Leading Indicators of Reward Hacking

Can we predict which models will reward-hack before they actually do it?

This project develops and validates methods to detect latent reward hacking propensity—specifically using prefill elicitation and importance sampling—on the djinn exploitable coding testbed.

Target: ICML 2026 submission (Jan 28 deadline)

Core Hypothesis

Prefill sensitivity + importance sampling can reveal which models (or which prompts) are at risk of reward hacking before training actually induces it.

Research Questions

Prefill as leading indicator: Does prefill sensitivity (completion probability shift when seeded with exploit-like reasoning) predict post-FT exploit rates?
Importance sampling for detection: Can ARC-style importance sampling surface high-risk inputs/models before deployment?
Generalization structure: Do leading indicators transfer across exploit families, or are they family-specific?

Setup

# Clone and install
git clone https://github.com/EleutherAI/rh-indicators
cd rh-indicators

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with djinn dependency
pip install -e ".[dev]"

# Install djinn from local path (development)
pip install -e /path/to/djinn

Project Structure

rh-indicators/
├── src/rh_indicators/     # Main package
├── scripts/               # Experiment scripts
├── docs/
│   └── decisions/         # Architecture Decision Records
├── outputs/               # Experiment outputs (gitignored)
└── README.md

Key Dependencies

djinn-framework: Exploitable coding problems and verifiers
Dataset: djinn/problems v0.9+ on HuggingFace

References

Anthropic (2024). "Natural emergent misalignment from reward hacking"
ARC (2024). "Importance sampling for AI control" (arXiv:2410.13211)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
docs/decisions		docs/decisions
scripts		scripts
src/rh_indicators		src/rh_indicators
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
setup_node_venv.sh		setup_node_venv.sh
test_movingTiles.py		test_movingTiles.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Leading Indicators of Reward Hacking

Core Hypothesis

Research Questions

Setup

Project Structure

Key Dependencies

References

About

Uh oh!

Releases

Packages

Languages

EleutherAI/rh-indicators

Folders and files

Latest commit

History

Repository files navigation

Leading Indicators of Reward Hacking

Core Hypothesis

Research Questions

Setup

Project Structure

Key Dependencies

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages