Hi! I noticed a few small, backward-compatible improvements that could clarify and harden the BLEU metric implementation.
- Support a simple string alias for the default tokenizer (e.g.
tokenizer="13a") in addition to passing a callable.
- Add explicit validation for length mismatches between
predictions and references.
- Document and add tests for the current behavior when predictions are empty strings (BLEU evaluates to 0.0 implicitly today).
These changes don’t alter default behavior and aim to improve usability, robustness, and reproducibility.