Sequence Frequency Puzzle (SFP) vs. Trie Heavy Hitters (TrieHH)

A Reproducible Comparison Under Local Differential Privacy

This repository implements two state-of-the-art local-DP heavy-hitter discovery algorithms:

SFP (Sequence Frequency Puzzle) – described in Apple’s differential privacy paper
TrieHH – Algorithm 1 from our AISTATS 2020 submission

Our code provides an end-to-end pipeline to preprocess data, run Monte-Carlo simulations, and reproduce the F1-score comparison between the two algorithms.

📦 Repository Contents

Core Algorithms

sfp/ — Implementation of Apple’s SFP algorithm
triehh/ — Implementation of TrieHH (interactive prefix-trie)
main.py — Runs Monte-Carlo simulations and generates the F1-score plot
preprocess.py — Prepares datasets for both SFP and TrieHH
dictionary.txt — (Optional) list of in-vocab words for out-of-vocab (OOV) experiments

▶️ How to Run

To reproduce our results:

bash run.sh

This script will:

Download the Sentiment140 dataset
Run preprocess.py
Run simulations using main.py
Automatically generate the F1 vs. K plot comparing SFP and TrieHH

All default parameters reproduce Figure 4 from our paper.

🧪 Running Out-of-Vocab (OOV) Experiments

dictionary.txt is intentionally empty by default.

To simulate OOV detection (similar to Figure 5 in the paper):

Add one in-vocab English word per line into dictionary.txt
Preprocessing will automatically remove these in-vocab words from the dataset
This produces an OOV-only dataset for evaluation

⚠️ In our paper we used 260k+ English words, but cannot publish the dictionary due to copyright and anonymity constraints.

📊 Experimental Pipeline (Reproduced from Report)

Based on our written experiment documentation (see written_report.pdf, pp. 1–3):

Dataset

Source: March 2025 English Wikipedia clickstream
Filter: rows with type == "link"
Sampled to ~2 M records → expanded into 1.78 M synthetic clients

Token Processing

Each token padded/truncated to length L = 16
$ appended for TrieHH
Preprocessing produces:
- clients_triehh.txt
- clients_sfp.txt
- word_frequencies.txt (ground-truth histogram)

Privacy Parameters

ϵ = 4
δ = 2.3 × 10⁻¹²
R = 5 Monte-Carlo runs
TrieHH vote threshold θ = 15
Batch size = 39,153

Simulation Flow (from `main.py`):

Run SimulateTrieHH and SimulateSFP for R trials
For K = 10…299, compute:
- precision
- recall
- F1 score
Produce the final plot with 95% confidence intervals (Student-t)

📈 Resulting Plot (Reproduced)

The F1 vs. K curve (Figure 1 in the report) shows:

TrieHH

Recovers ~30 titles per run
Recall = 0 until K ≥ 30
F1 peak < 0.20

SFP

Under this strict δ and small n, SFP outputs an empty set
Curve stays at 0 for all K

These results faithfully match theoretical behavior under tight DP constraints.

🧭 Interpretation

As explained in the report:

TrieHH is limited by the vote threshold θ = 15
SFP’s Bayesian posterior rejects all candidate strings under strong δ
Increasing n ≥ 5 M or loosening δ to 10⁻⁶ significantly improves utility
Exploring the privacy–utility tradeoffs in (ϵ, δ, L) is promising future work

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
DP_Presentation.pdf		DP_Presentation.pdf
DP_Presentation.pptx		DP_Presentation.pptx
LICENSE		LICENSE
Project Description.pdf		Project Description.pdf
Project Progress Report #1.pdf		Project Progress Report #1.pdf
Project_Progress_Report__2.pdf		Project_Progress_Report__2.pdf
Project_Progress_Report__3.pdf		Project_Progress_Report__3.pdf
README.md		README.md
click_sample.tsv		click_sample.tsv
clickstream_preprocess.py		clickstream_preprocess.py
clients_sfp.txt		clients_sfp.txt
clients_triehh.txt		clients_triehh.txt
f1_single.eps		f1_single.eps
f1_single.png		f1_single.png
main.py		main.py
preprocess.py		preprocess.py
result.txt		result.txt
run.sh		run.sh
sfp.py		sfp.py
triehh.py		triehh.py
word_frequencies.txt		word_frequencies.txt
written_report.pdf		written_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sequence Frequency Puzzle (SFP) vs. Trie Heavy Hitters (TrieHH)

A Reproducible Comparison Under Local Differential Privacy

📦 Repository Contents

Core Algorithms

▶️ How to Run

🧪 Running Out-of-Vocab (OOV) Experiments

📊 Experimental Pipeline (Reproduced from Report)

Dataset

Token Processing

Privacy Parameters

Simulation Flow (from `main.py`):

📈 Resulting Plot (Reproduced)

TrieHH

SFP

🧭 Interpretation

About

Uh oh!

Releases

Packages

Languages

License

YueranCao2001/DP_Final_Project

Folders and files

Latest commit

History

Repository files navigation

Sequence Frequency Puzzle (SFP) vs. Trie Heavy Hitters (TrieHH)

A Reproducible Comparison Under Local Differential Privacy

📦 Repository Contents

Core Algorithms

▶️ How to Run

🧪 Running Out-of-Vocab (OOV) Experiments

📊 Experimental Pipeline (Reproduced from Report)

Dataset

Token Processing

Privacy Parameters

Simulation Flow (from main.py):

📈 Resulting Plot (Reproduced)

TrieHH

SFP

🧭 Interpretation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Simulation Flow (from `main.py`):

Packages