CUMULUS-4352: implemented multipart downloads in AddMissingFileChecksum core task#4211
Conversation
…s-in-AddMissingFileChecksums-core-task
reweeden
left a comment
There was a problem hiding this comment.
As we talked about, it's quite disappointing that the JS SDK doesn't have the same canned multipart functionality as the Python one. Since it's actually quite complex to implement it might actually be easier to just convert this task to Python instead so we can leverage s3transfer which already has all the hiccups figured out.
Ideally if we stick with JS, I was imagining we would be implementing a helper in the cumulus libraries that would act similar to boto3's download_fileobj function as this would likely be useful in other tasks as well (for instance SyncGranule which has an option to do checksum validation). But given how much effort it's been and I'm not a JS expert, it's probably not worth trying to do that at this time.
for more information, see https://pre-commit.ci
…s-in-AddMissingFileChecksums-core-task
I think I agree with @reweeden here in that this functionality would be a good update to the S3 client's get object so that multipart downloads could be utilized elsewhere. I don't think it would be a difficult migration from the code you've written so far: if a I have a few small comments but I don't think they're relevant until we decide one way or another on where this should be implemented. |
| const { s3, algorithm, granuleFile } = params; | ||
|
|
||
| const { bucket, key } = granuleFile; | ||
| const { bucket, key, size } = granuleFile; |
There was a problem hiding this comment.
So what happens if size is null? Or would it be undefined? I'm not sure how it works in JS.
There was a problem hiding this comment.
If the value is null then it will default to the usual single-stream method of calculating the checksum.
| const partSizeBytes = partMb * 1024 * 1024; | ||
|
|
||
| const partitioningEnabled = (thresholdMb > 0 && partMb > 0) | ||
| && (granuleFile.size > thresholdBytes); |
There was a problem hiding this comment.
Using granuleFile.size vs just size variable which is the same?
There was a problem hiding this comment.
Good catch, this was a small mixup. Fixed this in recent commit.
…s-in-AddMissingFileChecksums-core-task
| import crypto from 'crypto'; | ||
| import { Granule, GranuleFile, HandlerInput, HandlerEvent } from './types'; | ||
|
|
||
| process.env.AWS_MAX_ATTEMPTS ??= '3'; |
There was a problem hiding this comment.
Could this result in setting those environment variables if not previously set? Maybe that's what you want but it might be better to declare a local here and set it to 3 if, for example, process.env.AWS_MAX_ATTEMPTS is undefined.
There was a problem hiding this comment.
e.g., the S3 client does this:
const s3JitterMaxMs = Number(process.env.S3_JITTER_MAX_MS || 0);
Which is probably what you want to do here too.
There was a problem hiding this comment.
This was intended to set the environment variable to 3 for enabling multiple retries for GET requests but I will ultimately just remove this and that can just be configured outside of the index.ts
| process.env.AWS_RETRY_MODE ??= 'standard'; | ||
| process.env.S3_JITTER_MAX_MS ??= '500'; | ||
|
|
||
| const updateHashFromBody = async (hash: crypto.Hash, body: any) => { |
There was a problem hiding this comment.
If possible it'd be good to type body here. If you wanted to be really strict you could probably grab the type that the S3 client getObject defines -- or at a minimum type the .on so it's clear how it's being used. Not totally necessary since this is a somewhat unique situation -- just get nervous when there are any types as parameters.
There was a problem hiding this comment.
Addressed this in recent commit, I typed the body so it's either Readable | Buffer | Uint8Array
| } | ||
|
|
||
| // Fallback if it is a non-stream | ||
| hash.update(Buffer.isBuffer(body) ? body : Buffer.from(body)); |
There was a problem hiding this comment.
So this throws the following typing error:
Conversion of type 'Buffer' to type 'BinaryLike' may be a mistake because neither type sufficiently overlaps with the other. If this was intentional, convert the expression to 'unknown' first.
Type 'Buffer' is not comparable to type 'Uint8Array<ArrayBufferLike> | DataView<ArrayBufferLike>'.
Type 'Buffer' is not comparable to type 'Uint8Array<ArrayBufferLike>'.
The types returned by 'slice(...).entries()' are incompatible between these types.
Type 'IterableIterator<[number, number]>' is missing the following properties from type 'ArrayIterator<[number, number]>': map, filter, take, drop, and 9 more.ts(2352)
After some Googling (because I wasn't sure exactly how to type this) it seems like a reasonable solution here is to add a @ts-expect-error Buffer is valid BinaryLike but TypeScript types are mismatched. Another solution is to double cast through unknown.
There was a problem hiding this comment.
Addressed this in most recent commit, was having a bunch of issues with the typing with this. I ended up using an approach where I normalize it into a buffer and then set it to a Uint8Array which seems to be working okay.
| const { bucket, key } = granuleFile; | ||
| const { bucket, key, size } = granuleFile; | ||
|
|
||
| const thresholdMb = Number(process.env.MULTIPART_CHECKSUM_THRESHOLD_MEGABYTES); |
There was a problem hiding this comment.
This (and the partMb) could result in undefined if the env vars aren't set.
There was a problem hiding this comment.
Addressed this in most recent commit, made it so it'll default to 0 if not set which will disable partitioning.
…s-in-AddMissingFileChecksums-core-task
Summary: Implements support for multipart downloads and checksum computation in AddMissingFileChecksums
Addresses CUMULUS-4352: Implement multipart downloads in ComputeMissingFilesChecksums core task
Changes
PR Checklist
📝 Note:
For most pull requests, please Squash and merge to maintain a clean and readable commit history.