Skip to content

Conversation

@KOKOSde
Copy link

@KOKOSde KOKOSde commented Feb 2, 2026

Fix unstable dataset fingerprinting when hashing PreTrainedTokenizerFast.

Some tokenizers backed by tokenizers.Tokenizer mutate runtime settings (padding/truncation) when called, which can change the serialized state and make dataset fingerprints unstable. That prevents .map(load_from_cache_file=True) from reusing cache files.

Fix: when hashing, temporarily disable backend padding/truncation so runtime settings don’t affect the fingerprint, then restore the original settings.

Includes a regression test showing Hasher.hash(tokenizer) stays stable after calling the tokenizer.

Tokenizers backed by `tokenizers` can mutate truncation/padding state when called, which made dataset transform fingerprints unstable and prevented `.map(load_from_cache_file=True)` from reusing cached results.

This change makes tokenizer hashing stable by temporarily clearing backend truncation/padding during serialization for fingerprinting, then restoring it.

Add a regression test and a simple benchmark to demonstrate cache-hit speedups.

Fixes huggingface#3847

Co-authored-by: Cursor <[email protected]>
@KOKOSde KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch from fac0934 to 8c1891b Compare February 4, 2026 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant