NLEF: Natural Language Evaluation Framework

Description

NLEF (Natural Language Evaluation Framework) is a framework designed to assess the quality, semantic faithfulness, and reliability of natural language (NL) explanations for node classification tasks performed by Graph Neural Networks (GNNs) on knowledge graphs. This work addresses the gap in rigorous evaluation methods for the increasing use of Large Language Models (LLMs) to generate accessible GNN explanations.

Relation to EDGE Framework

NLEF builds upon and extends the EDGE framework. It utilizes EDGE's datasets, GNN models, and baseline explainers (like EvoLearner) whose outputs can serve as input for NLEF's evaluation processes.

Setup / Installation

Follow these steps to set up the NLEF environment:

Step 1: Clone the Repository

Clone this repository:

git clone https://git.cs.uni-paderborn.de/dneib/nlef nlef_repo
cd nlef_repo

Step 2: Install Conda (if needed)

If you don't have Conda, install it from Anaconda's official website.

Step 3: Create and Activate Conda Environment

Create a conda environment using Python 3.10.13:

conda create --name nlef python=3.10.13 -y
conda activate nlef

(Alternatively, use python -m venv nlef_env && source nlef_env/bin/activate)

Step 4: Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

Step 5: DGL Installation (If Required)

If direct interaction with DGL is needed within NLEF, install the appropriate version. Refer to the official DGL website.

Example for CUDA 11.8 that we used on the Windows machine conducting the experiments below. (adjust based on DGL docs and your system)

pip install dgl -f https://data.dgl.ai/wheels/cu118/repo.html

Respective Linux command:

pip install dgl -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html

Step 6: PyTorch Installation

Based on your GPU / CPU devices, install the suitable version of Torch from official PyTorch Website.

Uninstall previous installations to avoid conflicts:

pip uninstall torch torchvision torchaudio -y

We installed torch version 2.3.0 suitable for CUDA 11.8 (Windows & Linux command):

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia -y

Step 7: Obtain Knowledge Graphs (KGs)

NLEF requires the OWL Knowledge Graphs corresponding to the datasets (AIFB, MUTAG, BGS). Ensure these are available, typically preprocessed via the EDGE framework and placed in data/KGs/.

mkdir -p data/KGs && unzip KGs.zip -d data/KGs/

(On Windows open folder in Git Bash to run the command or do it manually)

Step 8: Obtain API Keys

Gemini

OpenAI

Framework Functionality

The NLEF CLI provides commands for setup, NL generation, and NL evaluation. Access commands using python nlef .

Setup

NLEF requires API keys for LLMs (Google Gemini, OpenAI, Mistral). Set them up interactively:

python nlef setup-api-keys

This command helps store keys in a .env file in the project root.

Natural Language Generation Methods

Generate NL from EDGE/EvoLearner DL output

Example:

python nlef generate-from-edge-json --data aifb --model RGCN --llm gpt-4o-mini

Parameter Info

--data: Dataset or file identifier. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: GNN model name. (Default: RGCN)
--method: Explainer method (e.g., EvoLearner). (Default: EvoLearner)
--llm: LLM for translation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--sentence-number (-s): Max sentences used for the Natural Language Expression. (Default: 1)
--interactive (-i): Run interactively. (Default: False)

Generate NL directly using LLM + Ontology sampling

Example:

python nlef generate-natural-language-with-llm --data bgs --model gemini-2.0-flash-001 -g 3

Parameter Info

--data: Dataset name. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: LLM for generation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): Generation temperature. (Default: 0.7)
--max-output-tokens (-mot): Max tokens per output. (Default: 300)
--timeout (-o): LLM timeout (seconds). (Default: 60)
--max-retries (-r): Max LLM retries. (Default: 3)
--sentence-number (-s): Max sentences per generation. (Default: 1)
--generations (-g): Number of generations. (Default: 5)
--interactive (-i): Run interactively. (Default: False)

Generate NL directly using small LM + Ontology sampling

Example:

python nlef generate-sentences-with-lm --data aifb --model EleutherAI/gpt-neo-125M -n 3

Parameter Info

--data: Dataset name. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: Small language model name. (Default: EleutherAI/gpt-neo-125M)
--temperature (-t): Generation temperature. (Default: 0.7)
--max-length (-l): Max sentence length. (Default: 30)
--number-of-sentences (-n): Number of sentences to generate. (Default: 5)
--interactive (-i): Run interactively. (Default: False)

Natural Language Evaluation Methods

Evaluate NL using Approach 1 (DL Conversion)

Example:

python nlef evaluate-with-description-logic -d mutag -p test --nl-input nlef/NLEF/generated_natural_language/mutag_1.txt --save-results json

Parameter Info

--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM for NL-to-DL conversion. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0.0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 5)
--nl-input (-n): NL expression string or path to file (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)

Evaluate NL using Approach 2 (RAG - Single Instance Processing)

Example:

python nlef evaluate-with-rag -d aifb -n "Each project is financed by at most two publications that describe it."

Parameter Info

--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM used in RAG evaluation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 5)
--nl-input (-n): NL expression string or path to file (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--depth-limit (-de): Max retrieval depth. (Default: 4)
--doc-limit (-dl): Max retrieved documents. (Default: 100)
--max-workers (-w): Max parallel workers. (Default: 4)
--max-tokens (-mt): Max tokens for retrieval (potential LLM context limit). (Default: 100000)
--summarize (-sum): Summarize retrieved docs. (Default: False)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--verbose (-v): Print verbose output i.e. LLM reasoning. (Default: False)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)

Evaluate NL using Approach 2 (RAG - Batch Processing)

Example:

python nlef evaluate-with-rag-batch --data bgs --nl-input nlef/NLEF/generated_natural_language/bgs_1.txt

Parameter Info

--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM used in RAG evaluation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 10)
--nl-input (-n): Path to file containing NL expressions (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--depth-limit (-de): Max retrieval depth. (Default: 4)
--doc-limit (-dl): Max retrieved documents. (Default: 100)
--max-workers (-w): Max parallel workers. (Default: 4)
--max-tokens (-mt): Max tokens for retrieval (potential LLM context limit). (Default: 1000000)
--summarize (-sum): Summarize retrieved docs. (Default: False)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--verbose (-v): Print verbose output i.e. LLM reasoning. (Default: False)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)

For detailed information on all available options and arguments for any command, use the --help flag:

python nlef <command> --help

Example: python nlef evaluate-with-rag-batch --help

Experiments

The NLEF framework was evaluated using the following benchmark datasets from EDGE:

AIFB
MUTAG
BGS

The primary LLMs used for evaluation were:

Google Gemini-2.0-Flash
OpenAI GPT-4o-mini

Detailed descriptions of the experimental setup, procedures, performance metrics, results, and discussion can be found in the accompanying Master Thesis.

Running the Full Experiment Pipeline

The run_experiment.py script automates the process of running the baseline (EvoLearner via EDGE's main.py), generating natural language explanations from the baseline output, and evaluating these explanations using specified NLEF approaches and LLMs over multiple iterations.

Example Usage:

python run_experiment.py --datasets aifb mutag bgs --gnn-models RGCN --approaches description-logic rag-batch --chat-models gemini-2.0-flash-001 --iterations 5

RGCN Results using gemini-2.0-flash-001

Results generated using:

python run_experiment.py --chat-models gemini-2.0-flash-001

Click me!

Approach	Dataset	Pred Accuracy	Pred Precision	Pred Recall	Pred F1 Score	Exp Accuracy	Exp Precision	Exp Recall	Exp F1 Score
EvoLearner	AIFB	0.639	0.538	0.987	0.696	0.650	0.555	0.988	0.708
DL Conversion	AIFB	0.533	0.477	1.000	0.644	0.544	0.488	1.000	0.655
RAG Evaluation	AIFB	0.544	0.464	0.813	0.587	0.567	0.488	0.827	0.610
EvoLearner	Mutag	0.694	0.702	0.942	0.804	0.753	0.776	0.916	0.838
DL Conversion	Mutag	0.615	0.561	0.742	0.636	0.674	0.637	0.726	0.675
RAG Evaluation	Mutag	0.597	0.684	0.733	0.697	0.638	0.790	0.744	0.746
EvoLearner	BGS	0.393	0.363	0.980	0.528	0.366	0.333	0.978	0.492
DL Conversion	BGS	0.506	0.251	0.633	0.358	0.448	0.190	0.630	0.286
RAG Evaluation	BGS	0.455	0.443	0.840	0.518	0.414	0.373	0.816	0.454

RGCN Results using gpt-4o-mini

Results generated using:

python run_experiment.py

Click me!

Approach	Dataset	Pred Accuracy	Pred Precision	Pred Recall	Pred F1 Score	Exp Accuracy	Exp Precision	Exp Recall	Exp F1 Score
EvoLearner	AIFB	0.622	0.525	0.987	0.685	0.633	0.539	0.987	0.696
DL Conversion	AIFB	0.611	0.417	0.800	0.549	0.622	0.431	0.800	0.560
RAG Evaluation	AIFB	0.550	0.452	0.440	0.437	0.528	0.443	0.415	0.420
EvoLearner	Mutag	0.694	0.690	0.982	0.810	0.685	0.680	0.981	0.802
DL Conversion	Mutag	0.674	0.683	0.951	0.793	0.688	0.685	0.966	0.800
RAG Evaluation	Mutag	0.638	0.683	0.858	0.754	0.676	0.705	0.891	0.781
EvoLearner	BGS	0.462	0.404	0.880	0.531	0.448	0.378	0.886	0.507
DL Conversion	BGS	0.500	0.172	0.500	0.256	0.466	0.147	0.500	0.227
RAG Evaluation	BGS	0.434	0.280	0.740	0.406	0.448	0.270	0.758	0.398

RGAT Results using gemini-2.0-flash-001

Results generated using:

python run_experiment.py --gnn-models RGAT --chat-models gemini-2.0-flash-001

Click me!

Approach	Dataset	Pred Accuracy	Pred Precision	Pred Recall	Pred F1 Score	Exp Accuracy	Exp Precision	Exp Recall	Exp F1 Score
EvoLearner	AIFB	0.617	0.521	1.000	0.685	0.633	0.542	1.000	0.701
DL Conversion	AIFB	0.578	0.501	1.000	0.666	0.594	0.525	1.000	0.683
RAG Evaluation	AIFB	0.656	0.500	0.680	0.546	0.661	0.517	0.667	0.549
EvoLearner	MUTAG	0.682	0.688	0.960	0.799	0.741	0.748	0.947	0.835
DL Conversion	MUTAG	0.618	0.565	0.796	0.647	0.718	0.633	0.785	0.700
RAG Evaluation	MUTAG	0.606	0.680	0.751	0.704	0.671	0.776	0.769	0.761
EvoLearner	BGS	0.476	0.405	0.880	0.531	0.503	0.451	0.846	0.559
DL Conversion	BGS	0.510	0.221	0.580	0.320	0.483	0.232	0.585	0.326
RAG Evaluation	BGS	0.421	0.358	0.860	0.502	0.434	0.397	0.856	0.529

Citation

This repository contains the code to reproduce the results of the following paper:

@inproceedings{Heindorf2025NLEF,
  author       = {Stefan Heindorf and
                         Daniel Neib},
  title        = {Assessing Natural Language Explanations of Relational Graph Neural Networks},
  booktitle    = {{CIKM}},
  publisher    = {{ACM}},
  year         = {2025}
}

Known Issues

Issue:

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Can be resolved by:

pip uninstall numpy
pip uninstall scikit-learn
pip install scikit-learn==1.3.2

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
configs		configs
experiments		experiments
nlef		nlef
results		results
src		src
tests		tests
.env		.env
.gitignore		.gitignore
KGs.zip		KGs.zip
LICENSE		LICENSE
README.md		README.md
exp_visualize.py		exp_visualize.py
main.py		main.py
postprocess_kg.py		postprocess_kg.py
preprocess.sh		preprocess.sh
preprocess_kg.py		preprocess_kg.py
preprocessing_steps.md		preprocessing_steps.md
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLEF: Natural Language Evaluation Framework

Description

Relation to EDGE Framework

Setup / Installation

Step 1: Clone the Repository

Step 2: Install Conda (if needed)

Step 3: Create and Activate Conda Environment

Step 4: Install Dependencies

Step 5: DGL Installation (If Required)

Step 6: PyTorch Installation

Step 7: Obtain Knowledge Graphs (KGs)

Step 8: Obtain API Keys

Framework Functionality

Setup

Natural Language Generation Methods

Natural Language Evaluation Methods

Experiments

Running the Full Experiment Pipeline

RGCN Results using gemini-2.0-flash-001

RGCN Results using gpt-4o-mini

RGAT Results using gemini-2.0-flash-001

Citation

Known Issues

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ds-jrg/NLEF

Folders and files

Latest commit

History

Repository files navigation

NLEF: Natural Language Evaluation Framework

Description

Relation to EDGE Framework

Setup / Installation

Step 1: Clone the Repository

Step 2: Install Conda (if needed)

Step 3: Create and Activate Conda Environment

Step 4: Install Dependencies

Step 5: DGL Installation (If Required)

Step 6: PyTorch Installation

Step 7: Obtain Knowledge Graphs (KGs)

Step 8: Obtain API Keys

Framework Functionality

Setup

Natural Language Generation Methods

Natural Language Evaluation Methods

Experiments

Running the Full Experiment Pipeline

RGCN Results using gemini-2.0-flash-001

RGCN Results using gpt-4o-mini

RGAT Results using gemini-2.0-flash-001

Citation

Known Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages