Skip to content

chad-loder/efta-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Epstein Files Transparency Act Volume 8: what changed between the first and second DOJ postings

This is a preliminary analysis, something I hacked together in an afternoon using open source tools and standard forensics techniques. This should be considered a roadmap for deeper analysis, NOT definitive conclusions.

In late 2025, the U.S. Department of Justice published a set of documents under the Epstein Files Transparency Act (EFTA / H.R.4405). The DOJ has been releasing these files in tranches (Volumes 1-8 so far).

Volume 8, first published on Monday, Dec 21st 2025, and appears to have existed in at least two "builds": an earlier posting (yanked from the website) and a later replacement, with an unknown number of files removed or modified.

I’m a developer with a few hours on my hands and an unfortunate amount of curiosity. When I saw the DOJ publish Volume 8, yank it, and publish it again, I wanted to know what actually changed—and what only looked like it changed because of how PDFs get produced.

For clarity, I refer to the two builds as:

There has been speculation that the earlier build may have contained redaction problems or exposed PII. I can’t prove that from this analysis alone. What I can do is show my work, point to the small set of files that deserve scrutiny, and explain why this problem is more annoying than it looks.

Also, see the excellent forensic analysis of DOJ's eDiscovery tooling and PDF production pipeline by Peter Wyatt of the PDF Association. I wish I had discovered Peter's analysis before I did my work, it would have saved me a lot of experimentation! This post takes Peter's analysis further and attempts to provide a roadmap to answering some of the questions he poses.

Why “just check duplicates” fails (and why that matters)

If you download two ZIPs from two different times and ask “are these the same files?”, the obvious instinct is to compare the bytes. For this dataset, that approach mostly measures the publishing pipeline, not the underlying documents. The pipeline is the prankster here.

Here’s what seems to be going on:

  • The page numbers you see (EFTA…) are not stable IDs. The pipeline appears to reassign and burn those Bates-like stamps into pages, so the same document can be re-numbered between builds.
  • Scans get reprocessed. If a PDF contains scanned pages, the embedded OCR text layer can change from run to run—tiny spacing/line-break/character changes—without any visible change to the page image.
  • PDFs get rewritten internally. Even if the visible page is identical, the file can be re-saved with different metadata and internal structure (object ordering, xref tables, incremental updates).

That’s why this write-up leans on a mix of text-based grouping and old-fashioned visual spot checks.

Analysis approach (a slightly technical tour)

I didn’t want this to be a vibes-only comparison of two ZIP files, so I treated it like a small forensics project. The goal wasn’t “do the PDFs have the same bytes?”—it was “do they carry the same underlying documents?”

Step 1: treat PDFs as containers, not flat files

A lot of PDFs aren’t written once. They can be saved with incremental updates, which means later edits get appended to the end of the file. That can leave a kind of “version history” inside a single PDF: v1, v2, sometimes v3, all stacked together. In Volume 8 this shows up in practice as a pipeline that appears to do things like “add OCR text” and “apply Bates-like stamps” in later passes.

So I built a tool that can export those internal revisions into a clean on-disk structure (think versions/v0001, versions/v0002, …), and extract per-page text for each version.

Step 2: normalize away the stuff that shouldn’t count as a ‘different document’

Two things make naive comparisons explode:

  • Bates-like EFTA stamps change the visible page text and the OCR text.
  • OCR churn changes whitespace and characters even when the scan looks identical.

To make a one-to-one comparison possible, I compute per-page and per-document text signatures after a simple normalization pass:

  • strip EFTA########-style tokens (the Bates-like layer)
  • normalize whitespace (collapse the OCR’s micro-chaos into something stable enough to compare)

That gives me two useful identifiers:

  • a raw hash (the actual PDF bytes; great for integrity, terrible for equivalence)
  • a normalized content hash (far better for “is this the same document?”)

Step 3: an extremely normal SQLite database (with one weird idea)

Once you have signatures, you can treat files like a content-addressable store: same content → same ID, even if filenames differ. I threw together a small SQLite “object store” that tracks a few dimensions per file:

  • collection (v1 vs v2)
  • file path + raw SHA-256
  • PDF version count (v0001/v0002/v0003)
  • per-page and per-document normalized text signatures

It's not fancy, just enough structure to ask better questions than “did the bytes change?”

Step 4: a quick-and-dirty “does this page look the same?” analysis

Some PDFs are basically photographs of paper. When OCR is noisy (or missing), text matching can fail even if the scans are identical. The DOJ's OCR pipeline is someties subpar (which also means if you're indexing these PDFs trusting the embedded the OCR layers, you're missing stuff).

So I also used a simple perceptual image hash (dHash): render page 1 as an image, compute a tiny fingerprint, and then search for near-matches. It’s not perfect, but it’s a great way to turn “maybe this got renumbered” into a short shortlist you can eyeball in a few minutes.

Tools/libraries (open source)

  • PyMuPDF (PDF parsing, text extraction, rendering)
  • Pillow (image output for quick page previews)

What I found

A quick text-based comparison suggests the releases are overwhelmingly the same. The catch is that scans are tricky: the OCR text layer can change even when the pages look identical.

In the small set of “v2-only” items I checked closely, the pages looked the same; the differences appeared consistent with OCR noise, not visible redaction changes.

The list below is the part that matters if you’re trying to understand the “publish → yank → republish” story: a small set of PDFs from v1 that still appear to have no v2 equivalent found after targeted searches and visual comparisons.

PDFs from Vol 8 v1 that still look missing from Vol 8 v2 (tentative)

These files are included in vol1/ under their original filenames, along with images of the first two pages for quick preview.

PDF Pages EFTA range (as seen in text layer) What it is (skim)
EFTA00019416.pdf 8 EFTA00019416–EFTA00019423 internal email chain between a prosecutor and a third party about serving a court subpoena to Mar‑a‑Lago (records custodian + employment records), including coordination with Trump Organization compliance counsel. (digital email thread (not handwritten).)
EFTA00021043.pdf 303 EFTA00021043–EFTA00021345 a long court transcript / stenographic record labeled “Voir Dire” in United States v. Ghislaine Maxwell, 20-cr-330 (AJN). (transcript (not handwritten).)
EFTA00023791.pdf 4 EFTA00023791–EFTA00023794 short email chain about serving a trial subpoena to Mar‑a‑Lago / custodian of records (similar to other Mar‑a‑Lago subpoena coordination docs). (digital email thread.)
EFTA00039346.pdf 2 EFTA00039346–EFTA00039347 operational email “Re: BoP interviews” laying out a list of Bureau of Prisons personnel interviews to be conducted, including which interview teams will serve subpoenas afterward and how to avoid press at MCC. (digital email.)
EFTA00039350.pdf 3 EFTA00039350–EFTA00039352 FBI NY email thread titled “FW: MCC CCV” discussing access to a case / system (“TTK access”), referencing a separate case number 282B-NY-3156749. (digital email (not handwritten).)
EFTA00039353.pdf 3 EFTA00039353–EFTA00039355 FBI NY email thread related to MCC CCV / access provisioning (“TTK access”) and a referenced separate case number 282B-NY-3156749; appears to be follow-up/confirmation (“I will have you added. Will you need training as well?”). (digital email (not handwritten).)
EFTA00039368.pdf 1 EFTA00039368 essentially blank/placeholder page (text extraction shows mostly punctuation/dots + Bates stamp). (likely scanned/image-backed page with minimal text layer.)
EFTA00039373.pdf 1 EFTA00039373 essentially blank/placeholder page (OCR shows mostly punctuation/noise + Bates stamp). (likely scanned/image-backed page with minimal text layer.)
EFTA00039374.pdf 5 EFTA00039374–EFTA00039378 logistics/operations email thread about coordinating FedEx packages (count, type, weights; “pelicans”, “tuff box”), plus a note about team food costs. (digital email (not handwritten).)
EFTA00039379.pdf 2 EFTA00039379–EFTA00039380 short email exchange; includes a link to a Splinter News article about Epstein (“splinternews.com/i-wonder-why-jeffrey-epstein-reportedly-shipped-himself-…”). (digital email.)
EFTA00039385.pdf 1 EFTA00039385 near-blank / placeholder page (very little extracted text) (scanned/image-backed page with minimal text layer)
EFTA00039386.pdf 1 EFTA00039386 effectively a near-blank / placeholder page. (looks like a scanned/image-backed page with minimal/no text layer.)

Quick previews (first two pages)

Click any image to open it at full size.

EFTA00019416.pdf

  • What it appears to be: internal email chain between a prosecutor and a third party about serving a court subpoena to Mar‑a‑Lago (records custodian + employment records), including coordination with Trump Organization compliance counsel.
  • Document type: digital email thread (not handwritten).
  • Pages: 8
  • EFTA range (from text layer): EFTA00019416–EFTA00019423
  • PDF: vol1/EFTA00019416.pdf

EFTA00021043.pdf

  • What it appears to be: a long court transcript / stenographic record labeled “Voir Dire” in United States v. Ghislaine Maxwell, 20-cr-330 (AJN).
  • Document type: transcript (not handwritten).
  • Pages: 303
  • EFTA range (from text layer): EFTA00021043–EFTA00021345
  • PDF: vol1/EFTA00021043.pdf

EFTA00023791.pdf

  • What it appears to be: short email chain about serving a trial subpoena to Mar‑a‑Lago / custodian of records (similar to other Mar‑a‑Lago subpoena coordination docs).
  • Document type: digital email thread.
  • Pages: 4
  • EFTA range (from text layer): EFTA00023791–EFTA00023794
  • PDF: vol1/EFTA00023791.pdf

EFTA00039346.pdf

  • What it appears to be: operational email “Re: BoP interviews” laying out a list of Bureau of Prisons personnel interviews to be conducted, including which interview teams will serve subpoenas afterward and how to avoid press at MCC.
  • Document type: digital email.
  • Pages: 2
  • EFTA range (from text layer): EFTA00039346–EFTA00039347
  • PDF: vol1/EFTA00039346.pdf

EFTA00039350.pdf

  • What it appears to be: FBI NY email thread titled “FW: MCC CCV” discussing access to a case / system (“TTK access”), referencing a separate case number 282B-NY-3156749.
  • Document type: digital email (not handwritten).
  • Pages: 3
  • EFTA range (from text layer): EFTA00039350–EFTA00039352
  • PDF: vol1/EFTA00039350.pdf

EFTA00039353.pdf

  • What it appears to be: FBI NY email thread related to MCC CCV / access provisioning (“TTK access”) and a referenced separate case number 282B-NY-3156749; appears to be follow-up/confirmation (“I will have you added. Will you need training as well?”).
  • Document type: digital email (not handwritten).
  • Pages: 3
  • EFTA range (from text layer): EFTA00039353–EFTA00039355
  • PDF: vol1/EFTA00039353.pdf

EFTA00039368.pdf

  • What it appears to be: OCR layer is chaotic, it appears to be a photo of Epstein's cell door with caution tape over it and a handwritten sign. Text extraction shows mostly punctuation/dots + Bates stamp, which is not uncommon with PDFs that contain photos that themselves have text or glyphs.
  • Document type: likely scanned/image-backed page with minimal text layer.
  • Pages: 1
  • EFTA range (from text layer): EFTA00039368
  • PDF: vol1/EFTA00039368.pdf

EFTA00039373.pdf

  • What it appears to be: Picture of a tropical island, OCR layer shows mostly punctuation/noise + Bates stamp. For this PDF and the preceding one, it's possible my script didn't find it in the v2 Volume 8 drop, but it could well be in there. Just due to the way I hacked this together in an afternoon.
  • Document type: likely scanned/image-backed page with minimal text layer.
  • Pages: 1
  • EFTA range (from text layer): EFTA00039373
  • PDF: vol1/EFTA00039373.pdf

EFTA00039374.pdf

  • What it appears to be: logistics/operations email thread about coordinating FedEx packages (count, type, weights; “pelicans”, “tuff box”), plus a note about team food costs.
  • Document type: digital email (not handwritten).
  • Pages: 5
  • EFTA range (from text layer): EFTA00039374–EFTA00039378
  • PDF: vol1/EFTA00039374.pdf

EFTA00039379.pdf

  • What it appears to be: short email exchange between who I presume is AUSA Alex Rossmiller (pictured in the courtrooom sketch photo) in a colleague, where (presumably) Rossmiller jokes about his posture; includes a link to a Splinter News article about the Epstein trial (“splinternews.com/i-wonder-why-jeffrey-epstein-reportedly-shipped-himself-…”).
  • Document type: digital email.
  • Pages: 2
  • EFTA range (from text layer): EFTA00039379–EFTA00039380
  • PDF: vol1/EFTA00039379.pdf

EFTA00039385.pdf

  • What it appears to be: similar to EFTA00039368.pdf above, OCR layer is near-blank (very little extracted text, to be expected)
  • Document type: scanned/image-backed page with minimal text layer
  • Pages: 1
  • EFTA range (from text layer): EFTA00039385
  • PDF: vol1/EFTA00039385.pdf

EFTA00039386.pdf

  • What it appears to be: photo of an empty prison bunk with a mess of oorange jumpsuits. OCR layer is ffectively a near-blank page (to be expected).
  • Document type: looks like a scanned/image-backed page with minimal/no text layer.
  • Pages: 1
  • EFTA range (from text layer): EFTA00039386
  • PDF: vol1/EFTA00039386.pdf

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published