You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An optimization has been added for the Globus upload workflow, with a corresponding new database setting: `:GlobusBatchLookupSize`
4
+
5
+
6
+
See the [Database Settings](https://guides.dataverse.org/en/6.5/installation/config.html#GlobusBatchLookupSize) section of the Guides for more information.
### New API to Audit Datafiles across the database
2
+
3
+
This is a superuser only API endpoint to audit Datasets with DataFiles where the physical files are missing or the file metadata is missing.
4
+
The Datasets scanned can be limited by optional firstId and lastId query parameters, or a given CSV list of Dataset Identifiers.
5
+
Once the audit report is generated, a superuser can either delete the missing file(s) from the Dataset or contact the author to re-upload the missing file(s).
6
+
7
+
The JSON response includes:
8
+
- List of files in each DataFile where the file exists in the database but the physical file is not in the file store.
9
+
- List of DataFiles where the FileMetadata is missing.
10
+
- Other failures found when trying to process the Datasets
For more information, see [the docs](https://dataverse-guide--11016.org.readthedocs.build/en/11016/api/native-api.html#datafile-audit), #11016, and [#220](https://github.com/IQSS/dataverse.harvard.edu/issues/220)
Copy file name to clipboardExpand all lines: doc/sphinx-guides/source/api/native-api.rst
+66Lines changed: 66 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6300,6 +6300,72 @@ Note that if you are attempting to validate a very large number of datasets in y
6300
6300
6301
6301
asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600
6302
6302
6303
+
Datafile Audit
6304
+
~~~~~~~~~~~~~~
6305
+
6306
+
Produce an audit report of missing files and FileMetadata for Datasets.
6307
+
Scans the Datasets in the database and verifies that the stored files exist. If the files are missing or if the FileMetadata is missing, this information is returned in a JSON response.
6308
+
The call will return a status code of 200 if the report was generated successfully. Issues found will be documented in the report and will not return a failure status code unless the report could not be generated::
Copy file name to clipboardExpand all lines: doc/sphinx-guides/source/installation/config.rst
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4849,6 +4849,13 @@ The URL where the `dataverse-globus <https://github.com/scholarsportal/dataverse
4849
4849
4850
4850
The interval in seconds between Dataverse calls to Globus to check on upload progress. Defaults to 50 seconds (or to 10 minutes, when the ``globus-use-experimental-async-framework`` feature flag is enabled). See :ref:`globus-support` for details.
4851
4851
4852
+
.. _:GlobusBatchLookupSize:
4853
+
4854
+
:GlobusBatchLookupSize
4855
+
++++++++++++++++++++++
4856
+
4857
+
In the initial implementation, when files were added to the dataset upon completion of a Globus upload task, Dataverse would make a separate Globus API call to look up the size of every new file. This proved to be a significant bottleneck at Harvard Dataverse with users transferring batches of many thousands of files (this in turn was made possible by the Globus improvements in v6.4). An optimized lookup mechanism was added in response, where the Globus Service makes a listing API call on the entire remote folder, then populates the file sizes for all the new file entries before passing them to the Ingest service. This approach however may in fact slow things down in a scenario where there are already thousands of files in the Globus folder for the dataset, and only a small number of new files are being added. To address this, the number of files in a batch for which this method should be used was made configurable. If not set, it will default to 50 (a completely arbitrary number). Setting it to 0 will always use this method with Globus uploads. Setting it to some very large number will disable it completely. This was made a database setting, as opposed to a JVM option, in order to make it configurable in real time.
0 commit comments