This is a preliminary analysis, something I hacked together in an afternoon using open source tools and standard forensics techniques. This should be considered a roadmap for deeper analysis, NOT definitive conclusions.
In late 2025, the U.S. Department of Justice published a set of documents under the Epstein Files Transparency Act (EFTA / H.R.4405). The DOJ has been releasing these files in tranches (Volumes 1-8 so far).
Volume 8, first published on Monday, Dec 21st 2025, and appears to have existed in at least two "builds": an earlier posting (yanked from the website) and a later replacement, with an unknown number of files removed or modified.
I’m a developer with a few hours on my hands and an unfortunate amount of curiosity. When I saw the DOJ publish Volume 8, yank it, and publish it again, I wanted to know what actually changed—and what only looked like it changed because of how PDFs get produced.
For clarity, I refer to the two builds as:
- Vol 8 v1: an earlier build (obtained via DDoSecrets, apparently mirroring the original DOJ posting)
- Vol 8 v2: the later replacement build (downloaded directly from the DOJ website )
There has been speculation that the earlier build may have contained redaction problems or exposed PII. I can’t prove that from this analysis alone. What I can do is show my work, point to the small set of files that deserve scrutiny, and explain why this problem is more annoying than it looks.
Also, see the excellent forensic analysis of DOJ's eDiscovery tooling and PDF production pipeline by Peter Wyatt of the PDF Association. I wish I had discovered Peter's analysis before I did my work, it would have saved me a lot of experimentation! This post takes Peter's analysis further and attempts to provide a roadmap to answering some of the questions he poses.
If you download two ZIPs from two different times and ask “are these the same files?”, the obvious instinct is to compare the bytes. For this dataset, that approach mostly measures the publishing pipeline, not the underlying documents. The pipeline is the prankster here.
Here’s what seems to be going on:
- The page numbers you see (EFTA…) are not stable IDs. The pipeline appears to reassign and burn those Bates-like stamps into pages, so the same document can be re-numbered between builds.
- Scans get reprocessed. If a PDF contains scanned pages, the embedded OCR text layer can change from run to run—tiny spacing/line-break/character changes—without any visible change to the page image.
- PDFs get rewritten internally. Even if the visible page is identical, the file can be re-saved with different metadata and internal structure (object ordering, xref tables, incremental updates).
That’s why this write-up leans on a mix of text-based grouping and old-fashioned visual spot checks.
I didn’t want this to be a vibes-only comparison of two ZIP files, so I treated it like a small forensics project. The goal wasn’t “do the PDFs have the same bytes?”—it was “do they carry the same underlying documents?”
A lot of PDFs aren’t written once. They can be saved with incremental updates, which means later edits get appended to the end of the file. That can leave a kind of “version history” inside a single PDF: v1, v2, sometimes v3, all stacked together. In Volume 8 this shows up in practice as a pipeline that appears to do things like “add OCR text” and “apply Bates-like stamps” in later passes.
So I built a tool that can export those internal revisions into a clean on-disk structure (think versions/v0001, versions/v0002, …), and extract per-page text for each version.
Two things make naive comparisons explode:
- Bates-like EFTA stamps change the visible page text and the OCR text.
- OCR churn changes whitespace and characters even when the scan looks identical.
To make a one-to-one comparison possible, I compute per-page and per-document text signatures after a simple normalization pass:
- strip
EFTA########-style tokens (the Bates-like layer) - normalize whitespace (collapse the OCR’s micro-chaos into something stable enough to compare)
That gives me two useful identifiers:
- a raw hash (the actual PDF bytes; great for integrity, terrible for equivalence)
- a normalized content hash (far better for “is this the same document?”)
Once you have signatures, you can treat files like a content-addressable store: same content → same ID, even if filenames differ. I threw together a small SQLite “object store” that tracks a few dimensions per file:
- collection (v1 vs v2)
- file path + raw SHA-256
- PDF version count (v0001/v0002/v0003)
- per-page and per-document normalized text signatures
It's not fancy, just enough structure to ask better questions than “did the bytes change?”
Some PDFs are basically photographs of paper. When OCR is noisy (or missing), text matching can fail even if the scans are identical. The DOJ's OCR pipeline is someties subpar (which also means if you're indexing these PDFs trusting the embedded the OCR layers, you're missing stuff).
So I also used a simple perceptual image hash (dHash): render page 1 as an image, compute a tiny fingerprint, and then search for near-matches. It’s not perfect, but it’s a great way to turn “maybe this got renumbered” into a short shortlist you can eyeball in a few minutes.
A quick text-based comparison suggests the releases are overwhelmingly the same. The catch is that scans are tricky: the OCR text layer can change even when the pages look identical.
In the small set of “v2-only” items I checked closely, the pages looked the same; the differences appeared consistent with OCR noise, not visible redaction changes.
The list below is the part that matters if you’re trying to understand the “publish → yank → republish” story: a small set of PDFs from v1 that still appear to have no v2 equivalent found after targeted searches and visual comparisons.
These files are included in vol1/ under their original filenames, along with images of the first two pages for quick preview.
| Pages | EFTA range (as seen in text layer) | What it is (skim) | |
|---|---|---|---|
EFTA00019416.pdf |
8 | EFTA00019416–EFTA00019423 |
internal email chain between a prosecutor and a third party about serving a court subpoena to Mar‑a‑Lago (records custodian + employment records), including coordination with Trump Organization compliance counsel. (digital email thread (not handwritten).) |
EFTA00021043.pdf |
303 | EFTA00021043–EFTA00021345 |
a long court transcript / stenographic record labeled “Voir Dire” in United States v. Ghislaine Maxwell, 20-cr-330 (AJN). (transcript (not handwritten).) |
EFTA00023791.pdf |
4 | EFTA00023791–EFTA00023794 |
short email chain about serving a trial subpoena to Mar‑a‑Lago / custodian of records (similar to other Mar‑a‑Lago subpoena coordination docs). (digital email thread.) |
EFTA00039346.pdf |
2 | EFTA00039346–EFTA00039347 |
operational email “Re: BoP interviews” laying out a list of Bureau of Prisons personnel interviews to be conducted, including which interview teams will serve subpoenas afterward and how to avoid press at MCC. (digital email.) |
EFTA00039350.pdf |
3 | EFTA00039350–EFTA00039352 |
FBI NY email thread titled “FW: MCC CCV” discussing access to a case / system (“TTK access”), referencing a separate case number 282B-NY-3156749. (digital email (not handwritten).) |
EFTA00039353.pdf |
3 | EFTA00039353–EFTA00039355 |
FBI NY email thread related to MCC CCV / access provisioning (“TTK access”) and a referenced separate case number 282B-NY-3156749; appears to be follow-up/confirmation (“I will have you added. Will you need training as well?”). (digital email (not handwritten).) |
EFTA00039368.pdf |
1 | EFTA00039368 |
essentially blank/placeholder page (text extraction shows mostly punctuation/dots + Bates stamp). (likely scanned/image-backed page with minimal text layer.) |
EFTA00039373.pdf |
1 | EFTA00039373 |
essentially blank/placeholder page (OCR shows mostly punctuation/noise + Bates stamp). (likely scanned/image-backed page with minimal text layer.) |
EFTA00039374.pdf |
5 | EFTA00039374–EFTA00039378 |
logistics/operations email thread about coordinating FedEx packages (count, type, weights; “pelicans”, “tuff box”), plus a note about team food costs. (digital email (not handwritten).) |
EFTA00039379.pdf |
2 | EFTA00039379–EFTA00039380 |
short email exchange; includes a link to a Splinter News article about Epstein (“splinternews.com/i-wonder-why-jeffrey-epstein-reportedly-shipped-himself-…”). (digital email.) |
EFTA00039385.pdf |
1 | EFTA00039385 |
near-blank / placeholder page (very little extracted text) (scanned/image-backed page with minimal text layer) |
EFTA00039386.pdf |
1 | EFTA00039386 |
effectively a near-blank / placeholder page. (looks like a scanned/image-backed page with minimal/no text layer.) |
Click any image to open it at full size.
- What it appears to be: internal email chain between a prosecutor and a third party about serving a court subpoena to Mar‑a‑Lago (records custodian + employment records), including coordination with Trump Organization compliance counsel.
- Document type: digital email thread (not handwritten).
- Pages: 8
- EFTA range (from text layer):
EFTA00019416–EFTA00019423 - PDF:
vol1/EFTA00019416.pdf
- What it appears to be: a long court transcript / stenographic record labeled “Voir Dire” in United States v. Ghislaine Maxwell, 20-cr-330 (AJN).
- Document type: transcript (not handwritten).
- Pages: 303
- EFTA range (from text layer):
EFTA00021043–EFTA00021345 - PDF:
vol1/EFTA00021043.pdf
- What it appears to be: short email chain about serving a trial subpoena to Mar‑a‑Lago / custodian of records (similar to other Mar‑a‑Lago subpoena coordination docs).
- Document type: digital email thread.
- Pages: 4
- EFTA range (from text layer):
EFTA00023791–EFTA00023794 - PDF:
vol1/EFTA00023791.pdf
- What it appears to be: operational email “Re: BoP interviews” laying out a list of Bureau of Prisons personnel interviews to be conducted, including which interview teams will serve subpoenas afterward and how to avoid press at MCC.
- Document type: digital email.
- Pages: 2
- EFTA range (from text layer):
EFTA00039346–EFTA00039347 - PDF:
vol1/EFTA00039346.pdf
- What it appears to be: FBI NY email thread titled “FW: MCC CCV” discussing access to a case / system (“TTK access”), referencing a separate case number
282B-NY-3156749. - Document type: digital email (not handwritten).
- Pages: 3
- EFTA range (from text layer):
EFTA00039350–EFTA00039352 - PDF:
vol1/EFTA00039350.pdf
- What it appears to be: FBI NY email thread related to MCC CCV / access provisioning (“TTK access”) and a referenced separate case number
282B-NY-3156749; appears to be follow-up/confirmation (“I will have you added. Will you need training as well?”). - Document type: digital email (not handwritten).
- Pages: 3
- EFTA range (from text layer):
EFTA00039353–EFTA00039355 - PDF:
vol1/EFTA00039353.pdf
- What it appears to be: OCR layer is chaotic, it appears to be a photo of Epstein's cell door with caution tape over it and a handwritten sign. Text extraction shows mostly punctuation/dots + Bates stamp, which is not uncommon with PDFs that contain photos that themselves have text or glyphs.
- Document type: likely scanned/image-backed page with minimal text layer.
- Pages: 1
- EFTA range (from text layer):
EFTA00039368 - PDF:
vol1/EFTA00039368.pdf
- What it appears to be: Picture of a tropical island, OCR layer shows mostly punctuation/noise + Bates stamp. For this PDF and the preceding one, it's possible my script didn't find it in the v2 Volume 8 drop, but it could well be in there. Just due to the way I hacked this together in an afternoon.
- Document type: likely scanned/image-backed page with minimal text layer.
- Pages: 1
- EFTA range (from text layer):
EFTA00039373 - PDF:
vol1/EFTA00039373.pdf
- What it appears to be: logistics/operations email thread about coordinating FedEx packages (count, type, weights; “pelicans”, “tuff box”), plus a note about team food costs.
- Document type: digital email (not handwritten).
- Pages: 5
- EFTA range (from text layer):
EFTA00039374–EFTA00039378 - PDF:
vol1/EFTA00039374.pdf
- What it appears to be: short email exchange between who I presume is AUSA Alex Rossmiller (pictured in the courtrooom sketch photo) in a colleague, where (presumably) Rossmiller jokes about his posture; includes a link to a Splinter News article about the Epstein trial (“splinternews.com/i-wonder-why-jeffrey-epstein-reportedly-shipped-himself-…”).
- Document type: digital email.
- Pages: 2
- EFTA range (from text layer):
EFTA00039379–EFTA00039380 - PDF:
vol1/EFTA00039379.pdf
- What it appears to be: similar to EFTA00039368.pdf above, OCR layer is near-blank (very little extracted text, to be expected)
- Document type: scanned/image-backed page with minimal text layer
- Pages: 1
- EFTA range (from text layer):
EFTA00039385 - PDF:
vol1/EFTA00039385.pdf
- What it appears to be: photo of an empty prison bunk with a mess of oorange jumpsuits. OCR layer is ffectively a near-blank page (to be expected).
- Document type: looks like a scanned/image-backed page with minimal/no text layer.
- Pages: 1
- EFTA range (from text layer):
EFTA00039386 - PDF:
vol1/EFTA00039386.pdf



















