Skip to content

Commit 2a679b6

Browse files
committed
Merge remote-tracking branch 'IQSS/develop' into DANS-performance
2 parents d6f03bf + 1b5a1ea commit 2a679b6

File tree

22 files changed

+1521
-986
lines changed

22 files changed

+1521
-986
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
### Added whitespace trimming to uploaded custom metadata TSV files
2+
3+
When loading custom metadata blocks using the `api/admin/datasetfield/load` API, whitespace can be introduced into field names.
4+
This change trims whitespace at the beginning and end of all values read into the API before persisting them.
5+
6+
For more information, see #10688.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
## A new Globus optimization setting
2+
3+
An optimization has been added for the Globus upload workflow, with a corresponding new database setting: `:GlobusBatchLookupSize`
4+
5+
6+
See the [Database Settings](https://guides.dataverse.org/en/6.5/installation/config.html#GlobusBatchLookupSize) section of the Guides for more information.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
### New API to Audit Datafiles across the database
2+
3+
This is a superuser only API endpoint to audit Datasets with DataFiles where the physical files are missing or the file metadata is missing.
4+
The Datasets scanned can be limited by optional firstId and lastId query parameters, or a given CSV list of Dataset Identifiers.
5+
Once the audit report is generated, a superuser can either delete the missing file(s) from the Dataset or contact the author to re-upload the missing file(s).
6+
7+
The JSON response includes:
8+
- List of files in each DataFile where the file exists in the database but the physical file is not in the file store.
9+
- List of DataFiles where the FileMetadata is missing.
10+
- Other failures found when trying to process the Datasets
11+
12+
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles"
13+
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
14+
curl -H "X-Dataverse-key:$API_TOKEN" "http://localhost:8080/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/RVNT9Q,doi:10.5072/FK2/RVNT9Q"
15+
16+
For more information, see [the docs](https://dataverse-guide--11016.org.readthedocs.build/en/11016/api/native-api.html#datafile-audit), #11016, and [#220](https://github.com/IQSS/dataverse.harvard.edu/issues/220)

doc/sphinx-guides/source/api/native-api.rst

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6300,6 +6300,72 @@ Note that if you are attempting to validate a very large number of datasets in y
63006300
63016301
asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600
63026302
6303+
Datafile Audit
6304+
~~~~~~~~~~~~~~
6305+
6306+
Produce an audit report of missing files and FileMetadata for Datasets.
6307+
Scans the Datasets in the database and verifies that the stored files exist. If the files are missing or if the FileMetadata is missing, this information is returned in a JSON response.
6308+
The call will return a status code of 200 if the report was generated successfully. Issues found will be documented in the report and will not return a failure status code unless the report could not be generated::
6309+
6310+
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles"
6311+
6312+
Optional Parameters are available for filtering the Datasets scanned.
6313+
6314+
For auditing the Datasets in a paged manner (firstId and lastId)::
6315+
6316+
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?firstId=0&lastId=1000"
6317+
6318+
Auditing specific Datasets (comma separated list)::
6319+
6320+
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/admin/datafiles/auditFiles?datasetIdentifierList=doi:10.5072/FK2/JXYBJS,doi:10.7910/DVN/MPU019"
6321+
6322+
Sample JSON Audit Response::
6323+
6324+
{
6325+
"status": "OK",
6326+
"data": {
6327+
"firstId": 0,
6328+
"lastId": 100,
6329+
"datasetIdentifierList": [
6330+
"doi:10.5072/FK2/XXXXXX",
6331+
"doi:10.5072/FK2/JXYBJS",
6332+
"doi:10.7910/DVN/MPU019"
6333+
],
6334+
"datasetsChecked": 100,
6335+
"datasets": [
6336+
{
6337+
"id": 6,
6338+
"pid": "doi:10.5072/FK2/JXYBJS",
6339+
"persistentURL": "https://doi.org/10.5072/FK2/JXYBJS",
6340+
"missingFileMetadata": [
6341+
{
6342+
"storageIdentifier": "local://1930cce4f2d-855ccc51fcbb",
6343+
"dataFileId": "7"
6344+
}
6345+
]
6346+
},
6347+
{
6348+
"id": 47731,
6349+
"pid": "doi:10.5072/FK2/MPU019",
6350+
"persistentURL": "https://doi.org/10.7910/DVN/MPU019",
6351+
"missingFiles": [
6352+
{
6353+
"storageIdentifier": "s3://dvn-cloud:298910",
6354+
"directoryLabel": "trees",
6355+
"label": "trees.png"
6356+
}
6357+
]
6358+
}
6359+
],
6360+
"failures": [
6361+
{
6362+
"datasetIdentifier": "doi:10.5072/FK2/XXXXXX",
6363+
"reason": "Not Found"
6364+
}
6365+
]
6366+
}
6367+
}
6368+
63036369
Workflows
63046370
~~~~~~~~~
63056371

doc/sphinx-guides/source/installation/config.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4849,6 +4849,13 @@ The URL where the `dataverse-globus <https://github.com/scholarsportal/dataverse
48494849

48504850
The interval in seconds between Dataverse calls to Globus to check on upload progress. Defaults to 50 seconds (or to 10 minutes, when the ``globus-use-experimental-async-framework`` feature flag is enabled). See :ref:`globus-support` for details.
48514851

4852+
.. _:GlobusBatchLookupSize:
4853+
4854+
:GlobusBatchLookupSize
4855+
++++++++++++++++++++++
4856+
4857+
In the initial implementation, when files were added to the dataset upon completion of a Globus upload task, Dataverse would make a separate Globus API call to look up the size of every new file. This proved to be a significant bottleneck at Harvard Dataverse with users transferring batches of many thousands of files (this in turn was made possible by the Globus improvements in v6.4). An optimized lookup mechanism was added in response, where the Globus Service makes a listing API call on the entire remote folder, then populates the file sizes for all the new file entries before passing them to the Ingest service. This approach however may in fact slow things down in a scenario where there are already thousands of files in the Globus folder for the dataset, and only a small number of new files are being added. To address this, the number of files in a batch for which this method should be used was made configurable. If not set, it will default to 50 (a completely arbitrary number). Setting it to 0 will always use this method with Globus uploads. Setting it to some very large number will disable it completely. This was made a database setting, as opposed to a JVM option, in order to make it configurable in real time.
4858+
48524859
:GlobusSingleFileTransfer
48534860
+++++++++++++++++++++++++
48544861

src/main/java/edu/harvard/iq/dataverse/MailServiceBean.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ public Boolean sendNotificationEmail(UserNotification notification, String comme
283283
if (objectOfNotification != null){
284284
String messageText = getMessageTextBasedOnNotification(notification, objectOfNotification, comment, requestor);
285285
String subjectText = MailUtil.getSubjectTextBasedOnNotification(notification, objectOfNotification);
286-
if (!(messageText.isEmpty() || subjectText.isEmpty())){
286+
if (!(StringUtils.isEmpty(messageText) || StringUtils.isEmpty(subjectText))){
287287
retval = sendSystemEmail(emailAddress, subjectText, messageText, isHtmlContent);
288288
} else {
289289
logger.warning("Skipping " + notification.getType() + " notification, because couldn't get valid message");

0 commit comments

Comments
 (0)