Skip to content

New read api#233

Draft
cosmicexplorer wants to merge 5 commits intozip-rs:masterfrom
cosmicexplorer:new-read-api
Draft

New read api#233
cosmicexplorer wants to merge 5 commits intozip-rs:masterfrom
cosmicexplorer:new-read-api

Conversation

@cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Aug 11, 2024

Moved from #207 after weird github errors.

Problem

I believe there are several shortcomings that necessitate a slight reimplementation of much of src/read.rs, some involving breaking API changes:

  1. Propagating Send and/or Sync bounds from R to ZipFile<R> is literally impossible with the current public API that type-erases to dyn Read.
    • We could make two separate copies, one with dyn Read + Send, but that would be a lot of trouble and not worth it.
    • We could just assert that R is Send and always carry a dyn Read + Send, but that would be unsound and unsafe and could introduce memory unsafety.
  2. ZipFile has a lot of stateful logic, including a ZipFileReader::NoReader state, even necessitating an invalid_state() helper method for error checking.
    • Notably (and imo inexcusably), ZipFile has a Drop impl which is only used when reading stream archives (to ensure we read out the rest of the entry from the stream), but this niche use case necessitates every reader struct in the crate to provide an .into_inner() method.
  3. To underline the reason this Drop impl for ZipFile is a bad idea, there is currently a bug in ZipStreamReader which drops the first ZipStreamFileMetadata entry, because read_zipfile_from_stream() does not save the contents of the last ZipLocalEntryBlock it reads which has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE.
    • Basically, streaming requires special handling, not being bolted onto ZipFile.
    • This change means we don't need to go through the ZipStreamVisitor interface, and we can have direct control over when to iterate through each entry of the streaming zip file. That's what zip::unstable::read::streaming provides.
    • Importantly, StreamingArchive maintains the internal state necessary to retain the bytes of the first metadata entry after we go through all the local file entries, fixing the bug in ZipStreamReader which drops the first metadata entry.
  4. Because streaming requires special handling and separate structs for entries, we also need to be able to get access to all the methods for accessing ZipFileData that we have on ZipFile, but for other structs. That's why this change introduces EntryData, which implements all the methods traditionally provided as member functions of ZipFile.
    • This is generally useful, as it allows us to avoid publicly exposing ZipFileData itself, and we can instead keep that to ourselves as an internal implementation detail.
    • In general, this gives us the ability to virtualize or introduce other indirections to access entry metadata, without changing the interface expected by end users, or having to reimplement any of our convenience methods for different types of zip backing archives.
  5. Finally, ZipFileReader::Compressed(Box<Crc32Reader<Decompressor<io::BufReader<CryptoReader<'a>>>>>) always goes through the logic for CryptoReader, when encrypted zip files are an extremely niche use case. We shouldn't be cluttering up our type definitions or our stack traces with encryption logic when most users are never going to touch it!
        let final_reader = if data.encrypted {
            let crypto_variant = CryptoKeyValidationSource::from_data(data)?;
            let is_ae2_encrypted = crypto_variant.is_ae2_encrypted();
            let crypto_reader = crypto_variant.make_crypto_reader(password, content)?;
            let entry_reader =
                construct_decompressing_reader(&data.compression_method, crypto_reader)?;
            if is_ae2_encrypted {
                /* Ae2 voids crc checking: https://www.winzip.com/en/support/aes-encryption/ */
                CryptoEntryReader::Ae2Encrypted(entry_reader)
            } else {
                CryptoEntryReader::NonAe2Encrypted(NewCrc32Reader::new(entry_reader, data.crc32))
            }
        } else {
            /* Not encrypted, so do the same as in .by_index_generic(): */
            let entry_reader = construct_decompressing_reader(&data.compression_method, content)?;
            CryptoEntryReader::Unencrypted(NewCrc32Reader::new(entry_reader, data.crc32))
        };
    /// Get a contained file by index
    pub fn by_index_generic(
        &mut self,
        file_number: usize,
    ) -> ZipResult<ZipEntry<'_, impl Read + '_>> {
        let Self {
            ref mut reader,
            ref shared,
            ..
        } = self;
        let (_, data) = shared
            .files
            .get_index(file_number)
            .ok_or(ZipError::FileNotFound)?;
        let content = find_entry_content_range(data, reader)?;
        let entry_reader = construct_decompressing_reader(&data.compression_method, content)?;
        let crc32_reader = NewCrc32Reader::new(entry_reader, data.crc32);
        Ok(ZipEntry {
            data,
            reader: crc32_reader,
        })
    }

That's the entire method! No jumping into decryption code paths for the much more common case of simple decompression with crc32 checking!

So that's my case for the API changes. We can do them in pieces, but I'm proposing this intentionally breaking change that I believe will improve the internal architecture and correctness, increase our ability to perform optimizations, and generate more flexible entry structs in our public API.

More Specifically

This section was copied from #207.

As described in gyscos/zstd-rs#288 (comment), our usage of &'a mut dyn Read in ZipFileReader stops us from being able to impl Send or Sync, or generally to send a ZipFile into a separate thread than the one owning the ZipArchive it belongs to. While this is generally not an issue, as the single synchronous Read + Seek handle makes it difficult to farm out work to subthreads, for the purposes of #72, we would like to be able to reuse some code instead of creating an entirely separate implementation. Note: #72 does not actually use this code, and this is a false argument. However, it is correct that other code using this crate, especially in async contexts, could really use the ability to impose Send bounds, which ZipFile is unable to satisfy.

That brings us to the second set of issues, surrounding zipcrypto support:

  • all of our readers go through the CryptoReader enum and the make_crypto_reader() method, even though zip crypto is a very uncommon use case,
  • ZipFile itself contains some very complex mutable state with both a ZipFileReader and a CryptoReader, even though ZipFileReader itself also contains a CryptoReader,
  • Crc32Reader has to maintain a special flag to disable CRC checking in the special case of AE2 encryption, which is a sign of a leaky abstraction and may easily introduce a failure to check CRC in other cases by accident in the future.

Solution

Luckily, all of this becomes significantly easier without too many changes:

  • Create src/unstable/read.rs to reimplement read::ZipFile with the new unstable::read::ZipEntry.
  • Make EntryReader and ZipEntry with a parameterized reader type R instead of an internal &'a mut dyn Read.
  • Make new versions of e.g. ZipArchive::by_index() returning ZipResult<ZipEntry<'_, impl Read + '_>> to avoid leaking internal structs while retaining the ability to rearrange the layout of our internal structs.
  • Vastly improve readability of CryptoReader creation through CryptoVariant.

This also leads to a significantly improved refactoring of streaming support in the streaming submodule of src/unstable/read.rs.

Result

zipcrypto support doesn't muck up the non-crypto methods, ZipEntry is now Send, and src/unstable/read.rs is significantly easier to read with equivalent functionality. This will have to be a breaking change in order to achieve Send-ability, but I believe the readability/maintainability is substantially improved with this change.

TODO

  • Make ZipFile wrap a dyn Read handle produced from ZipEntry, so there's no CryptoReader type in our non-crypto code paths.
  • Make ZipStreamReader use StreamingArchive internally (generating ZipFile instances from the ZipEntrys it produces).

if self.check == res {
Ok(())
} else {
/* TODO: make this into our own Crc32Error error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality

Suspicious comment
where
R: Read + Seek,
{
// TODO: use .get_or_try_init() once stabilized to provide a closure returning a Result!

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality

Suspicious comment
reader: R,
) -> Result<EntryReader<R>, ZipError>
where
/* TODO: this really shouldn't be required upon construction (especially since the reader

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality

Suspicious comment
let xz_reader = XzDecoder::new(buf_reader);
Ok(EntryReader::Xz(xz_reader))
}
/* TODO: make this into its own EntryReadError error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality

Suspicious comment

#[allow(deprecated)]
if let CompressionMethod::Unsupported(_) = data.compression_method {
/* TODO: make this into its own EntryReadError error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality

Suspicious comment
info,
compressed_size: data.compressed_size,
});
/* TODO: make this into its own EntryReadError error type! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality

Suspicious comment
}

if let Some(last_modified_time) = data.last_modified_time {
/* TODO: use let chains once stabilized! */

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality

Suspicious comment
@cosmicexplorer cosmicexplorer added the rust Pull requests that update Rust code label Aug 11, 2024
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Aug 11, 2024

I've just added the new API to the fuzz testing in 568fcab, and benchmarks confirming it achieves the same performance as the current API in 2731e7e. These benchmarks also show that decompressing entries with the stream API is just as fast as iterating over entries from a ZipArchive. Note that I made zip::read::stream fully public in order to benchmark it against my new zip::unstable::read::streaming API, as it was both unused beforehand and marked as pub(crate).

@cosmicexplorer cosmicexplorer force-pushed the new-read-api branch 6 times, most recently from 7f56cfe to e882a72 Compare August 11, 2024 19:10
@Pr0methean
Copy link
Member

Pr0methean commented Aug 11, 2024

Re point 3: I'm confused about why this is a bug. If the block has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE, aren't we supposed to ignore it, since the metadata in streaming mode comes from local headers?

@cosmicexplorer cosmicexplorer mentioned this pull request Aug 12, 2024
3 tasks
@Pr0methean
Copy link
Member

Re point 3: I'm confused about why this is a bug. If the block has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE, aren't we supposed to ignore it, since the metadata in streaming mode comes from local headers?

@cosmicexplorer I'm not sure I can properly review this PR without an answer to this question.

@cosmicexplorer
Copy link
Contributor Author

Re point 3: I'm confused about why this is a bug. If the block has spec::Magic::CENTRAL_DIRECTORY_HEADER_SIGNATURE, aren't we supposed to ignore it, since the metadata in streaming mode comes from local headers?

That's correct, for the case in which we are only interested in local blocks, but it causes us to drop the first metadata block when we are interested in the visit_additional_metadata() method for ZipStreamVisitor from src/read/stream.rs. That is currently not an exposed API, so there's currently no harm and no bug exposed to users, but I am using it as an example of how read_zipfile_from_stream() itself cannot be used to implement ZipStreamReader without dropping that first central directory block.

This is actually not an indictment of src/read/stream.rs, as ZipStreamReader actually demonstrates exactly the interface we want--in fact, what I should have done is probably just to implement ZipStreamReader in terms of unstable::read::streaming::StreamingArchive. However StreamingArchive makes use of ZipEntry (with the parameterized reader instead of the &mut dyn Read), so it's hard to see how to do that in a compatible way.

I will think about whether ZipFile can be made to store a dyn Read handle produced from the ZipEntry code in unstable/read.rs. That would be a really interesting way to use this code internally without breaking the API.

@cosmicexplorer
Copy link
Contributor Author

I guess I should say another thing for context: this shouldn't really be considered a priority, and it doesn't block any current work on the crate (including #208 or #235). I continue to think it's a useful refactoring change, but given that it necessarily requires introducing a breaking API change, it might be worth leaving on the back burner for now and coming back to when there are other breaking changes we want to introduce with a major version bump.

I was very interested to see whether it would produce any performance improvement (maybe it would when operating over an in-memory Cursor instead of File i/o?), but a vtable call through dyn Read is really cheap, so I think the most important angle is that ZipEntry from this PR can impl Send (in fact it does so automatically), which the current ZipFile API is unable to do without introducing unsoundness. I believe this generally means a ZipFile can't be transferred across await points, but since a ZipFile also contains a mutable reference to the parent ZipArchive anyway (ZipEntry does too, since the returned impl Read takes a mutable reference to the read stream R from ZipArchive<R>), I don't believe anyone using this crate in async code is able to operate over ZipFile entries separate from parent ZipArchives anyway. So the Send impl isn't really too important in practice, especially since work like #208 (introducing more high-level abstractions to enable our own parallelism) is likely to achieve whatever people might want an async API for (this assumption may be wrong).

The part that I like the most about this (and the reason for its creation) is how the parameterized R: Read stream enables us to avoid polluting our ZipFile type with any knowledge of the CryptoReader (because I feel that that code is especially difficult to read and extend, especially e.g. requiring us to insert a boolean flag in Crc32Reader). There is a secondary benefit too that we do not need a Drop impl for ZipFile which is only used for streaming.

Given the idea I just had above (making ZipFile wrap a dyn Read handle using the code from ZipEntry), I'm going to see whether we can get most of the benefit of this change without breaking the API:

  • Make ZipFile wrap a dyn Read handle produced from ZipEntry, so there's no CryptoReader type in our non-crypto code paths.
  • Make ZipStreamReader use StreamingArchive internally (generating ZipFile instances from the ZipEntrys it produces).

I'm going to mark this as draft until I can do the above, and ping you when I've got something for review.

@cosmicexplorer cosmicexplorer marked this pull request as draft August 17, 2024 18:31
- also use git dep for zstd to pull in removal of R: BufRead bound
- move new methods into unstable/read.rs
- support getting aes verification info from a raw ZipEntry
- flesh out other methods from ZipFile -> ZipEntry
- now create a streaming abstraction
- move some methods around
- make AesModeInfo pub(crate) to make its intention clear
- make ArchiveEntry more publicly accessible
- make ZipFile impl EntryData
Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I didn't have time to finish this review. Could you please resolve the merge conflicts, and then either mark this PR as ready to merge or provide an update on your plans for it?

* tests/data/generate-king-lear-zip.py.
*/
fn get_test_data() -> ZipResult<ZipArchive<fs::File>> {
let path = Path::new(env!("CARGO_MANIFEST_DIR")).join("tests/data/king-lear-compressed.zip");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer include_bytes! here, to eliminate the type of disk and the state of the filesystem as potential sources of noise.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Mar 19, 2026

I reread my wording here and it's really remarkably rude. I am deeply sorry for subjecting you to that. I very frequently report to many people how thankful I am to work with someone who is an expert and so ready to listen to larger changes like the parallel extraction work. I found a repo today pinned to an extremely old version and I felt deeply glad you have put in the work to make this crate robust.

I have been overwhelmed with the job search for a while and I also find myself pulled in the direction of compression and archiving. i am convinced zstd is not as clever as its author claims, and jarek duda's work on tANS can be made less obfuscated—zstd seems to work, but the multiple separate c2rust projects are not going to solve anything. There's also the matter of their very bold attempt to warn the reader about "watermarking" (not asking you to take an opinion on this, just to explain why I keep acting dormant) https://amass.energy/rustdoc/yzx-unstable/yzx_core/frame/skippable/#privacy-risk-watermarking

I'm also pretty confident that the zstd frame format enables a length extension attack for .tar.zst files by incorporting a sequence of RLE blocks of length 1, and whatever the next byte you want to insert is. Note that the zstd stream format has two distinct RLE blocks, one that can occur within the frame container, and one that is more "top-level". I began with the frame format at https://codeberg.org/cosmicexplorer/corporeal/src/commit/edb8a0e222c2b72e7d499f71b4a1b219cd302a76/yzx/core/src/frame.rs#L333—they have very recently added more content to the RFC to describe the format of dictionaries (but the dictionaries may support prefix extension). For this and previous reasons I have decided it is not a dependency I would trust with my life, and decided to address that gap.

I am likely not able to make claims to specific work on this repo for the next few months until I have proven zstd is missing something or I am. I just remembered this communication to you and felt ashamed. I think there are good ideas here and I know I can and will follow up.

Actually, I would like to mention one further curious development (not recent). Part of the reason my contributions have been inconstant results from the python packaging PEP contributors. Some points regarding security that I cannot possibly solve:

This next part is pertinent to the zip crate because of public statements made by PSF regarding the zip format. As background, I invented in 2020 every technique which is now standard practice to improve python resolve performance, and gave a conference talk about it in 2023: https://web.archive.org/web/20250218154403/https://cfp.packaging-con.org/2023/talk/hpuhu7/. This next part is security-pertinent:

The astral private repo is more than my personal grousing. It would be the week right before this big product launch that pypi and astral performed a coordinated disclosure of a zip file CVE that is completely made up from whole cloth: https://astral.sh/blog/uv-security-advisory-cve-2025-54368

https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusion-attacks/
I do enjoy the term "confusion" here given that I remain unable to find a repro for it. One particular feature that you or some other contributor had in place already in the unstable API is a clear indication that the local and central headers intentionally support different data. I do not believe there is any change to make for us except to note that specifically google has somehow decided zip is problematic to them.

Possibly the main reason any of this response is remotely justified is this written report from the pypi security developer in residence, who constructs an entire universe (he even mentions fuzzing at the end: https://alpha-omega.dev/wp-content/uploads/sites/22/2025/10/ao_wp_102725a.pdf) in which zip is unsafe and this crate does not exist.

Astral also funds one of the c2rust zstd impls https://trifectatech.org/initiatives/data-compression/ (which I really must impress upon you are not actual rust crates but a facsimile that no one maintains and no one understands at any level). This one has multiple rustc devs working on it, none of whom I am familiar with.

However, since I have decided all this information is security-relevant but already public, it's also worth noting ruzstd, which is a runtime dep of every rust binary (for backtraces), recently began to engage in a farcical degree of data falsification: KillingSpark/zstd-rs#91 (comment) not a security concern as of yet, but not appropriate for the stdlib and disappointing given the crate seems to have written their own rust code for it.

I guess there is one more point worth considering. This PR: rust-lang/rust#140525 :

I do not seriously expect astral to ever surface any competitor to the zip format, but my whole conference talk described why I/O is far broader than straight-line performance and directly challenged cargo to improve (they did in fact fix the git cloning shit after that). If I ever see claims about zip file security overstated to sell a specific alternative product, I will onboard myself again ASAP to do whatever I can, because neither users nor enterprises benefit from format wars.

However, all this has made working with rust into an anxiety-inducing prospect. I restricted myself to this targeted feedback: rust-lang/rust#138475 (comment).

Oh, and Andrew Gallant is great to work with btw and writes very good documentation. The ripgrep debug "perf" claim is not his work at all.

I sincerely believe this information is all relevant to you in your professional capacity, and I do not request anything of you. I have learned much from our work together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants