The code in this paper can be used to reproduce the results in the paper.
Huggingface models used in the experiments:
- Fanar1-9B https://huggingface.co/QCRI/Fanar-1-9B-Instruct
- Qwen2.5-7B https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
- Gemma3-12B https://huggingface.co/google/gemma-3-12b-it
This project supports a study on how LLMs respond to different context conditions: with correct context (WCC), with incorrect context (WOC), or without context (WOC) during question answering. It includes tools for generating responses, extracting token-level uncertainty metrics, and visualizing uncertainty and error-type transitions.
Main pipeline for:
- Generating responses from an LLM with and without context.
- Collecting uncertainty metrics (aleatoric, epistemic).
- Applying automatic correctness labeling and error-type classification.
- Saving the enriched results into a CSV for analysis.
python generate_responses.py <model_name> <dataset_path> <output_path> <in_context>- <model_name>: HuggingFace model name or path (e.g., Qwen, Fanar, Gemma)
- <dataset_path>: Path to .csv or .parquet file with questions, answers, and context
- <output_path>: Destination file path to save output results
- <in_context>: Either wic (with incorrect context) or wcc (with correct context)
Toolkit for:
- Loading .parquet result files
- Parsing token-level metrics and logits
- Computing full-sequence reliability scores
- Plotting KDE distributions of uncertainty metrics (EU, AU, reliability)
Customize the final call to plot_figure(...) to generate your desired output.
Install dependencies with:
pip install torch transformers openai pandas numpy matplotlib seaborn tqdm python-dotenv
Create a .env file in the root of your project with the following keys:
AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/
AZURE_OPENAI_API_KEY=your-azure-openai-api-key
AZURE_DEPLOYMENT_NAME=gpt-4-deployment-name
AZURE_API_VERSION=2024-02-15-preview
HF_TOKEN=your-huggingface-token
Ensure this .env file is not committed to version control. Add it to your .gitignore file.
This directory contains four modular scripts designed to explore whether token-level uncertainty in large language models (LLMs) can be used to predict the factual reliability of generated responses. The full pipeline includes response generation, correctness labeling, hidden state extraction, and classifier-based probing.
generation.py— Generates responses using greedy decoding.GPT_labeler.py— Uses GPT (e.g., ChatGPT) to label factual correctness and extract minimal answer spans.compute_hidden_states.py— Extracts token-level hidden states based on uncertainty-aware token selection strategies.probe_exp.py— Trains lightweight probing classifiers on extracted hidden state features.
Use generation.py to generate LLM responses for selected datasets with greedy decoding:
python generation.py --module gemma --datasets truthfulqa triviaqa mathLabel the generated responses for correctness and extract minimal answer spans:
python GPT_labeler.py --models fanar gemma qwen --datasets truthfulqa triviaqa math3. Compute Hidden States
Extract hidden states based on token-level uncertainty:
python compute_hidden_states.py \
--model [model] \
--datasets [dataset(s)] \
--uncertainty_type [au | eu | agg] \
--K [K]--uncertainty_type: Strategy for selecting target tokens (e.g., AU = aleatoric uncertainty, EU = epistemic uncertainty, AGG = aggregated).--K: Number of top tokens to use for feature construction.
Train classifiers to predict factual correctness using extracted features:
python probe_exp.py \
--model [model] \
--dataset [dataset] \
--uncertainty_type [au | eu | agg] \
--K [K]This will evaluate probing performance across all layers of the selected model and dataset.