Skip to content

CUMULUS-4352: implemented multipart downloads in AddMissingFileChecksum core task#4211

Open
brandonlokey wants to merge 10 commits intomasterfrom
CUMULUS-4352--implement-multipart-downloads-in-AddMissingFileChecksums-core-task
Open

CUMULUS-4352: implemented multipart downloads in AddMissingFileChecksum core task#4211
brandonlokey wants to merge 10 commits intomasterfrom
CUMULUS-4352--implement-multipart-downloads-in-AddMissingFileChecksums-core-task

Conversation

@brandonlokey
Copy link

@brandonlokey brandonlokey commented Jan 14, 2026

Summary: Implements support for multipart downloads and checksum computation in AddMissingFileChecksums

Addresses CUMULUS-4352: Implement multipart downloads in ComputeMissingFilesChecksums core task

Changes

  • Added support for multi-part downloads and checksum computation for granules that exceed threshold size.
  • Thresholds and partitions are configurable through MULTIPART_CHECKSUM_THRESHOLD_MEGABYTES and MULTIPART_CHECKSUM_PART_MEGABYTES environment variables. Defaults to standard single stream method otherwise if not defined or below threshold.
  • Made size attribute required in the granuleFile object and input schema.

PR Checklist

  • Update CHANGELOG
  • Unit tests
  • Ad-hoc testing - Deploy changes and test manually
  • Integration tests

📝 Note:
For most pull requests, please Squash and merge to maintain a clean and readable commit history.

Copy link
Contributor

@reweeden reweeden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we talked about, it's quite disappointing that the JS SDK doesn't have the same canned multipart functionality as the Python one. Since it's actually quite complex to implement it might actually be easier to just convert this task to Python instead so we can leverage s3transfer which already has all the hiccups figured out.

Ideally if we stick with JS, I was imagining we would be implementing a helper in the cumulus libraries that would act similar to boto3's download_fileobj function as this would likely be useful in other tasks as well (for instance SyncGranule which has an option to do checksum validation). But given how much effort it's been and I'm not a JS expert, it's probably not worth trying to do that at this time.

@brandonlokey brandonlokey marked this pull request as ready for review January 23, 2026 22:18
@paulpilone
Copy link
Contributor

Ideally if we stick with JS, I was imagining we would be implementing a helper in the cumulus libraries that would act similar to boto3's download_fileobj function as this would likely be useful in other tasks as well (for instance SyncGranule which has an option to do checksum validation). But given how much effort it's been and I'm not a JS expert, it's probably not worth trying to do that at this time.

I think I agree with @reweeden here in that this functionality would be a good update to the S3 client's get object so that multipart downloads could be utilized elsewhere. I don't think it would be a difficult migration from the code you've written so far: if a GetObjectMultipartMethod was added to our S3 client I think it would mostly just be the logic you've written in the calculateGranuleFileChecksum and calculateObjectHashByRanges functions -- just not actually doing the hash calculating but returning either resolved chunks or writing the file to a stream. I could see the value in only returning chunks and updating the hash like you are now ... so I think if we were to move this functionality into the S3 client we'd want to expose that as a return value vs always writing the file to a stream.

I have a few small comments but I don't think they're relevant until we decide one way or another on where this should be implemented.

const { s3, algorithm, granuleFile } = params;

const { bucket, key } = granuleFile;
const { bucket, key, size } = granuleFile;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if size is null? Or would it be undefined? I'm not sure how it works in JS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value is null then it will default to the usual single-stream method of calculating the checksum.

const partSizeBytes = partMb * 1024 * 1024;

const partitioningEnabled = (thresholdMb > 0 && partMb > 0)
&& (granuleFile.size > thresholdBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using granuleFile.size vs just size variable which is the same?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this was a small mixup. Fixed this in recent commit.

@brandonlokey brandonlokey removed the request for review from bhazuka February 2, 2026 16:50
import crypto from 'crypto';
import { Granule, GranuleFile, HandlerInput, HandlerEvent } from './types';

process.env.AWS_MAX_ATTEMPTS ??= '3';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this result in setting those environment variables if not previously set? Maybe that's what you want but it might be better to declare a local here and set it to 3 if, for example, process.env.AWS_MAX_ATTEMPTS is undefined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., the S3 client does this:

const s3JitterMaxMs = Number(process.env.S3_JITTER_MAX_MS || 0);

Which is probably what you want to do here too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intended to set the environment variable to 3 for enabling multiple retries for GET requests but I will ultimately just remove this and that can just be configured outside of the index.ts

process.env.AWS_RETRY_MODE ??= 'standard';
process.env.S3_JITTER_MAX_MS ??= '500';

const updateHashFromBody = async (hash: crypto.Hash, body: any) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible it'd be good to type body here. If you wanted to be really strict you could probably grab the type that the S3 client getObject defines -- or at a minimum type the .on so it's clear how it's being used. Not totally necessary since this is a somewhat unique situation -- just get nervous when there are any types as parameters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in recent commit, I typed the body so it's either Readable | Buffer | Uint8Array

}

// Fallback if it is a non-stream
hash.update(Buffer.isBuffer(body) ? body : Buffer.from(body));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this throws the following typing error:

Conversion of type 'Buffer' to type 'BinaryLike' may be a mistake because neither type sufficiently overlaps with the other. If this was intentional, convert the expression to 'unknown' first.
  Type 'Buffer' is not comparable to type 'Uint8Array<ArrayBufferLike> | DataView<ArrayBufferLike>'.
    Type 'Buffer' is not comparable to type 'Uint8Array<ArrayBufferLike>'.
      The types returned by 'slice(...).entries()' are incompatible between these types.
        Type 'IterableIterator<[number, number]>' is missing the following properties from type 'ArrayIterator<[number, number]>': map, filter, take, drop, and 9 more.ts(2352)

After some Googling (because I wasn't sure exactly how to type this) it seems like a reasonable solution here is to add a @ts-expect-error Buffer is valid BinaryLike but TypeScript types are mismatched. Another solution is to double cast through unknown.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in most recent commit, was having a bunch of issues with the typing with this. I ended up using an approach where I normalize it into a buffer and then set it to a Uint8Array which seems to be working okay.

const { bucket, key } = granuleFile;
const { bucket, key, size } = granuleFile;

const thresholdMb = Number(process.env.MULTIPART_CHECKSUM_THRESHOLD_MEGABYTES);
Copy link
Contributor

@paulpilone paulpilone Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the partMb) could result in undefined if the env vars aren't set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in most recent commit, made it so it'll default to 0 if not set which will disable partitioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants