Fix unstable tokenizer fingerprinting (enables map cache reuse) #7982

KOKOSde · 2026-02-02T23:34:51Z

Fix unstable dataset fingerprinting when hashing PreTrainedTokenizerFast.

Some tokenizers backed by tokenizers.Tokenizer mutate runtime settings (padding/truncation) when called, which can change the serialized state and make dataset fingerprints unstable. That prevents .map(load_from_cache_file=True) from reusing cache files.

Fix: when hashing, temporarily disable backend padding/truncation so runtime settings don’t affect the fingerprint, then restore the original settings.

Includes a regression test showing Hasher.hash(tokenizer) stays stable after calling the tokenizer.

Tokenizers backed by `tokenizers` can mutate truncation/padding state when called, which made dataset transform fingerprints unstable and prevented `.map(load_from_cache_file=True)` from reusing cached results. This change makes tokenizer hashing stable by temporarily clearing backend truncation/padding during serialization for fingerprinting, then restoring it. Add a regression test and a simple benchmark to demonstrate cache-hit speedups. Fixes huggingface#3847 Co-authored-by: Cursor <[email protected]>

KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch from fac0934 to 8c1891b Compare February 4, 2026 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unstable tokenizer fingerprinting (enables map cache reuse) #7982

Fix unstable tokenizer fingerprinting (enables map cache reuse) #7982

KOKOSde commented Feb 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix unstable tokenizer fingerprinting (enables map cache reuse) #7982

Are you sure you want to change the base?

Fix unstable tokenizer fingerprinting (enables map cache reuse) #7982

Conversation

KOKOSde commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KOKOSde commented Feb 2, 2026 •

edited

Loading