CUMULUS-4352: implemented multipart downloads in AddMissingFileChecksum core task by brandonlokey · Pull Request #4211 · nasa/cumulus

brandonlokey · 2026-01-14T22:07:20Z

Summary: Implements support for multipart downloads and checksum computation in AddMissingFileChecksums

Addresses CUMULUS-4352: Implement multipart downloads in ComputeMissingFilesChecksums core task

Changes

Added support for multi-part downloads and checksum computation for granules that exceed threshold size.
Thresholds and partitions are configurable through MULTIPART_CHECKSUM_THRESHOLD_MEGABYTES and MULTIPART_CHECKSUM_PART_MEGABYTES environment variables. Defaults to standard single stream method otherwise if not defined or below threshold.
Made size attribute required in the granuleFile object and input schema.

PR Checklist

Update CHANGELOG
Unit tests
Ad-hoc testing - Deploy changes and test manually
Integration tests

📝 Note:
For most pull requests, please Squash and merge to maintain a clean and readable commit history.

…tition values

…s-in-AddMissingFileChecksums-core-task

reweeden

As we talked about, it's quite disappointing that the JS SDK doesn't have the same canned multipart functionality as the Python one. Since it's actually quite complex to implement it might actually be easier to just convert this task to Python instead so we can leverage s3transfer which already has all the hiccups figured out.

Ideally if we stick with JS, I was imagining we would be implementing a helper in the cumulus libraries that would act similar to boto3's download_fileobj function as this would likely be useful in other tasks as well (for instance SyncGranule which has an option to do checksum validation). But given how much effort it's been and I'm not a JS expert, it's probably not worth trying to do that at this time.

tasks/add-missing-file-checksums/schemas/input.json

tasks/add-missing-file-checksums/tests/test-index.ts

…ests

for more information, see https://pre-commit.ci

…s-in-AddMissingFileChecksums-core-task

paulpilone · 2026-01-26T16:12:35Z

Ideally if we stick with JS, I was imagining we would be implementing a helper in the cumulus libraries that would act similar to boto3's download_fileobj function as this would likely be useful in other tasks as well (for instance SyncGranule which has an option to do checksum validation). But given how much effort it's been and I'm not a JS expert, it's probably not worth trying to do that at this time.

I think I agree with @reweeden here in that this functionality would be a good update to the S3 client's get object so that multipart downloads could be utilized elsewhere. I don't think it would be a difficult migration from the code you've written so far: if a GetObjectMultipartMethod was added to our S3 client I think it would mostly just be the logic you've written in the calculateGranuleFileChecksum and calculateObjectHashByRanges functions -- just not actually doing the hash calculating but returning either resolved chunks or writing the file to a stream. I could see the value in only returning chunks and updating the hash like you are now ... so I think if we were to move this functionality into the S3 client we'd want to expose that as a return value vs always writing the file to a stream.

I have a few small comments but I don't think they're relevant until we decide one way or another on where this should be implemented.

reweeden · 2026-01-29T17:39:59Z