Skip to content

Implements and evaluates TrieHH and SFP heavy-hitter algorithms under local differential privacy, including full data preprocessing, simulation, and F1-score analysis on Wikipedia Clickstream. Designed for reproducible experiments and privacy–utility benchmarking.

License

Notifications You must be signed in to change notification settings

YueranCao2001/DP_Final_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence Frequency Puzzle (SFP) vs. Trie Heavy Hitters (TrieHH)

A Reproducible Comparison Under Local Differential Privacy

This repository implements two state-of-the-art local-DP heavy-hitter discovery algorithms:

  • SFP (Sequence Frequency Puzzle) – described in Apple’s differential privacy paper
  • TrieHH – Algorithm 1 from our AISTATS 2020 submission

Our code provides an end-to-end pipeline to preprocess data, run Monte-Carlo simulations, and reproduce the F1-score comparison between the two algorithms.


📦 Repository Contents

Core Algorithms

  • sfp/ — Implementation of Apple’s SFP algorithm
  • triehh/ — Implementation of TrieHH (interactive prefix-trie)
  • main.py — Runs Monte-Carlo simulations and generates the F1-score plot
  • preprocess.py — Prepares datasets for both SFP and TrieHH
  • dictionary.txt — (Optional) list of in-vocab words for out-of-vocab (OOV) experiments

▶️ How to Run

To reproduce our results:

bash run.sh

This script will:

  • Download the Sentiment140 dataset
  • Run preprocess.py
  • Run simulations using main.py
  • Automatically generate the F1 vs. K plot comparing SFP and TrieHH

All default parameters reproduce Figure 4 from our paper.


🧪 Running Out-of-Vocab (OOV) Experiments

dictionary.txt is intentionally empty by default.

To simulate OOV detection (similar to Figure 5 in the paper):

  1. Add one in-vocab English word per line into dictionary.txt
  2. Preprocessing will automatically remove these in-vocab words from the dataset
  3. This produces an OOV-only dataset for evaluation

⚠️ In our paper we used 260k+ English words, but cannot publish the dictionary due to copyright and anonymity constraints.


📊 Experimental Pipeline (Reproduced from Report)

Based on our written experiment documentation (see written_report.pdf, pp. 1–3):

Dataset

  • Source: March 2025 English Wikipedia clickstream
  • Filter: rows with type == "link"
  • Sampled to ~2 M records → expanded into 1.78 M synthetic clients

Token Processing

  • Each token padded/truncated to length L = 16
  • $ appended for TrieHH
  • Preprocessing produces:
    • clients_triehh.txt
    • clients_sfp.txt
    • word_frequencies.txt (ground-truth histogram)

Privacy Parameters

  • ϵ = 4
  • δ = 2.3 × 10⁻¹²
  • R = 5 Monte-Carlo runs
  • TrieHH vote threshold θ = 15
  • Batch size = 39,153

Simulation Flow (from main.py):

  • Run SimulateTrieHH and SimulateSFP for R trials
  • For K = 10…299, compute:
    • precision
    • recall
    • F1 score
  • Produce the final plot with 95% confidence intervals (Student-t)

📈 Resulting Plot (Reproduced)

The F1 vs. K curve (Figure 1 in the report) shows:

TrieHH

  • Recovers ~30 titles per run
  • Recall = 0 until K ≥ 30
  • F1 peak < 0.20

SFP

  • Under this strict δ and small n, SFP outputs an empty set
  • Curve stays at 0 for all K

These results faithfully match theoretical behavior under tight DP constraints.


🧭 Interpretation

As explained in the report:

  • TrieHH is limited by the vote threshold θ = 15
  • SFP’s Bayesian posterior rejects all candidate strings under strong δ
  • Increasing n ≥ 5 M or loosening δ to 10⁻⁶ significantly improves utility
  • Exploring the privacy–utility tradeoffs in (ϵ, δ, L) is promising future work

About

Implements and evaluates TrieHH and SFP heavy-hitter algorithms under local differential privacy, including full data preprocessing, simulation, and F1-score analysis on Wikipedia Clickstream. Designed for reproducible experiments and privacy–utility benchmarking.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published