Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

The code in this paper can be used to reproduce the results in the paper.

Huggingface models used in the experiments:

Fanar1-9B https://huggingface.co/QCRI/Fanar-1-9B-Instruct
Qwen2.5-7B https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Gemma3-12B https://huggingface.co/google/gemma-3-12b-it

RQ1: How does in-context information influence model be- havior and response uncertainty?

This project supports a study on how LLMs respond to different context conditions: with correct context (WCC), with incorrect context (WOC), or without context (WOC) during question answering. It includes tools for generating responses, extracting token-level uncertainty metrics, and visualizing uncertainty and error-type transitions.

📁 Repository Contents

`generate_responses.py`

Main pipeline for:

Generating responses from an LLM with and without context.
Collecting uncertainty metrics (aleatoric, epistemic).
Applying automatic correctness labeling and error-type classification.
Saving the enriched results into a CSV for analysis.

Usage

python generate_responses.py <model_name> <dataset_path> <output_path> <in_context>

<model_name>: HuggingFace model name or path (e.g., Qwen, Fanar, Gemma)
<dataset_path>: Path to .csv or .parquet file with questions, answers, and context
<output_path>: Destination file path to save output results
<in_context>: Either wic (with incorrect context) or wcc (with correct context)

`visualization.py`

Toolkit for:

Loading .parquet result files
Parsing token-level metrics and logits
Computing full-sequence reliability scores
Plotting KDE distributions of uncertainty metrics (EU, AU, reliability)

Customize the final call to plot_figure(...) to generate your desired output.

🔧 Requirements

Install dependencies with:

pip install torch transformers openai pandas numpy matplotlib seaborn tqdm python-dotenv

🛠️ Sample .env file

Create a .env file in the root of your project with the following keys:

AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/
AZURE_OPENAI_API_KEY=your-azure-openai-api-key
AZURE_DEPLOYMENT_NAME=gpt-4-deployment-name
AZURE_API_VERSION=2024-02-15-preview

HF_TOKEN=your-huggingface-token

Ensure this .env file is not committed to version control. Add it to your .gitignore file.

Research Question 2: Can Uncertainty Signals Be Used to Predict Response Reliability?

This directory contains four modular scripts designed to explore whether token-level uncertainty in large language models (LLMs) can be used to predict the factual reliability of generated responses. The full pipeline includes response generation, correctness labeling, hidden state extraction, and classifier-based probing.

📁 Modules Overview

generation.py — Generates responses using greedy decoding.
GPT_labeler.py — Uses GPT (e.g., ChatGPT) to label factual correctness and extract minimal answer spans.
compute_hidden_states.py — Extracts token-level hidden states based on uncertainty-aware token selection strategies.
probe_exp.py — Trains lightweight probing classifiers on extracted hidden state features.

🚀 Usage Instructions

1. Generate Responses

Use generation.py to generate LLM responses for selected datasets with greedy decoding:

python generation.py --module gemma --datasets truthfulqa triviaqa math

2. Label Responses with GPT

Label the generated responses for correctness and extract minimal answer spans:

python GPT_labeler.py --models fanar gemma qwen --datasets truthfulqa triviaqa math

3. Compute Hidden States

Extract hidden states based on token-level uncertainty:

python compute_hidden_states.py \
    --model [model] \
    --datasets [dataset(s)] \
    --uncertainty_type [au | eu | agg] \
    --K [K]

--uncertainty_type: Strategy for selecting target tokens (e.g., AU = aleatoric uncertainty, EU = epistemic uncertainty, AGG = aggregated).
--K: Number of top tokens to use for feature construction.

4. Train Probing Classifiers

Train classifiers to predict factual correctness using extracted features:

python probe_exp.py \
    --model [model] \
    --dataset [dataset] \
    --uncertainty_type [au | eu | agg] \
    --K [K]

This will evaluate probing performance across all layers of the selected model and dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
rq1		rq1
rq2		rq2
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

RQ1: How does in-context information influence model be- havior and response uncertainty?

📁 Repository Contents

`generate_responses.py`

Usage

`visualization.py`

🔧 Requirements

🛠️ Sample .env file

Research Question 2: Can Uncertainty Signals Be Used to Predict Response Reliability?

📁 Modules Overview

🚀 Usage Instructions

1. Generate Responses

2. Label Responses with GPT

3. Compute Hidden States

4. Train Probing Classifiers

About

Uh oh!

Releases

Packages

Languages

qcri/in-context-uncertainty

Folders and files

Latest commit

History

Repository files navigation

Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

RQ1: How does in-context information influence model be- havior and response uncertainty?

📁 Repository Contents

generate_responses.py

Usage

visualization.py

🔧 Requirements

🛠️ Sample .env file

Research Question 2: Can Uncertainty Signals Be Used to Predict Response Reliability?

📁 Modules Overview

🚀 Usage Instructions

1. Generate Responses

2. Label Responses with GPT

3. Compute Hidden States

4. Train Probing Classifiers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`generate_responses.py`

`visualization.py`

Packages