NLEF (Natural Language Evaluation Framework) is a framework designed to assess the quality, semantic faithfulness, and reliability of natural language (NL) explanations for node classification tasks performed by Graph Neural Networks (GNNs) on knowledge graphs. This work addresses the gap in rigorous evaluation methods for the increasing use of Large Language Models (LLMs) to generate accessible GNN explanations.
NLEF builds upon and extends the EDGE framework. It utilizes EDGE's datasets, GNN models, and baseline explainers (like EvoLearner) whose outputs can serve as input for NLEF's evaluation processes.
Follow these steps to set up the NLEF environment:
Clone this repository:
git clone https://git.cs.uni-paderborn.de/dneib/nlef nlef_repo
cd nlef_repoIf you don't have Conda, install it from Anaconda's official website.
Create a conda environment using Python 3.10.13:
conda create --name nlef python=3.10.13 -y
conda activate nlef(Alternatively, use python -m venv nlef_env && source nlef_env/bin/activate)
Install the required Python packages:
pip install -r requirements.txtIf direct interaction with DGL is needed within NLEF, install the appropriate version. Refer to the official DGL website.
Example for CUDA 11.8 that we used on the Windows machine conducting the experiments below. (adjust based on DGL docs and your system)
pip install dgl -f https://data.dgl.ai/wheels/cu118/repo.htmlRespective Linux command:
pip install dgl -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.htmlBased on your GPU / CPU devices, install the suitable version of Torch from official PyTorch Website.
Uninstall previous installations to avoid conflicts:
pip uninstall torch torchvision torchaudio -yWe installed torch version 2.3.0 suitable for CUDA 11.8 (Windows & Linux command):
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia -yNLEF requires the OWL Knowledge Graphs corresponding to the datasets (AIFB, MUTAG, BGS). Ensure these are available, typically preprocessed via the EDGE framework and placed in data/KGs/.
mkdir -p data/KGs && unzip KGs.zip -d data/KGs/ (On Windows open folder in Git Bash to run the command or do it manually)
The NLEF CLI provides commands for setup, NL generation, and NL evaluation. Access commands using python nlef .
NLEF requires API keys for LLMs (Google Gemini, OpenAI, Mistral). Set them up interactively:
python nlef setup-api-keysThis command helps store keys in a .env file in the project root.
Generate NL from EDGE/EvoLearner DL output
Example:
python nlef generate-from-edge-json --data aifb --model RGCN --llm gpt-4o-miniParameter Info
--data: Dataset or file identifier. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: GNN model name. (Default: RGCN)
--method: Explainer method (e.g., EvoLearner). (Default: EvoLearner)
--llm: LLM for translation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--sentence-number (-s): Max sentences used for the Natural Language Expression. (Default: 1)
--interactive (-i): Run interactively. (Default: False)
Generate NL directly using LLM + Ontology sampling
Example:
python nlef generate-natural-language-with-llm --data bgs --model gemini-2.0-flash-001 -g 3Parameter Info
--data: Dataset name. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: LLM for generation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): Generation temperature. (Default: 0.7)
--max-output-tokens (-mot): Max tokens per output. (Default: 300)
--timeout (-o): LLM timeout (seconds). (Default: 60)
--max-retries (-r): Max LLM retries. (Default: 3)
--sentence-number (-s): Max sentences per generation. (Default: 1)
--generations (-g): Number of generations. (Default: 5)
--interactive (-i): Run interactively. (Default: False)
Generate NL directly using small LM + Ontology sampling
Example:
python nlef generate-sentences-with-lm --data aifb --model EleutherAI/gpt-neo-125M -n 3Parameter Info
--data: Dataset name. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: Small language model name. (Default: EleutherAI/gpt-neo-125M)
--temperature (-t): Generation temperature. (Default: 0.7)
--max-length (-l): Max sentence length. (Default: 30)
--number-of-sentences (-n): Number of sentences to generate. (Default: 5)
--interactive (-i): Run interactively. (Default: False)
Evaluate NL using Approach 1 (DL Conversion)
Example:
python nlef evaluate-with-description-logic -d mutag -p test --nl-input nlef/NLEF/generated_natural_language/mutag_1.txt --save-results jsonParameter Info
--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM for NL-to-DL conversion. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0.0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 5)
--nl-input (-n): NL expression string or path to file (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)
Evaluate NL using Approach 2 (RAG - Single Instance Processing)
Example:
python nlef evaluate-with-rag -d aifb -n "Each project is financed by at most two publications that describe it."Parameter Info
--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM used in RAG evaluation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 5)
--nl-input (-n): NL expression string or path to file (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--depth-limit (-de): Max retrieval depth. (Default: 4)
--doc-limit (-dl): Max retrieved documents. (Default: 100)
--max-workers (-w): Max parallel workers. (Default: 4)
--max-tokens (-mt): Max tokens for retrieval (potential LLM context limit). (Default: 100000)
--summarize (-sum): Summarize retrieved docs. (Default: False)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--verbose (-v): Print verbose output i.e. LLM reasoning. (Default: False)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)
Evaluate NL using Approach 2 (RAG - Batch Processing)
Example:
python nlef evaluate-with-rag-batch --data bgs --nl-input nlef/NLEF/generated_natural_language/bgs_1.txtParameter Info
--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM used in RAG evaluation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 10)
--nl-input (-n): Path to file containing NL expressions (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--depth-limit (-de): Max retrieval depth. (Default: 4)
--doc-limit (-dl): Max retrieved documents. (Default: 100)
--max-workers (-w): Max parallel workers. (Default: 4)
--max-tokens (-mt): Max tokens for retrieval (potential LLM context limit). (Default: 1000000)
--summarize (-sum): Summarize retrieved docs. (Default: False)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--verbose (-v): Print verbose output i.e. LLM reasoning. (Default: False)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)
For detailed information on all available options and arguments for any command, use the --help flag:
python nlef <command> --helpExample: python nlef evaluate-with-rag-batch --help
The NLEF framework was evaluated using the following benchmark datasets from EDGE:
- AIFB
- MUTAG
- BGS
The primary LLMs used for evaluation were:
- Google Gemini-2.0-Flash
- OpenAI GPT-4o-mini
Detailed descriptions of the experimental setup, procedures, performance metrics, results, and discussion can be found in the accompanying Master Thesis.
The run_experiment.py script automates the process of running the baseline (EvoLearner via EDGE's main.py), generating natural language explanations from the baseline output, and evaluating these explanations using specified NLEF approaches and LLMs over multiple iterations.
Example Usage:
python run_experiment.py --datasets aifb mutag bgs --gnn-models RGCN --approaches description-logic rag-batch --chat-models gemini-2.0-flash-001 --iterations 5Results generated using:
python run_experiment.py --chat-models gemini-2.0-flash-001Click me!
| Approach | Dataset | Pred Accuracy | Pred Precision | Pred Recall | Pred F1 Score | Exp Accuracy | Exp Precision | Exp Recall | Exp F1 Score |
|---|---|---|---|---|---|---|---|---|---|
| EvoLearner | AIFB | 0.639 | 0.538 | 0.987 | 0.696 | 0.650 | 0.555 | 0.988 | 0.708 |
| DL Conversion | AIFB | 0.533 | 0.477 | 1.000 | 0.644 | 0.544 | 0.488 | 1.000 | 0.655 |
| RAG Evaluation | AIFB | 0.544 | 0.464 | 0.813 | 0.587 | 0.567 | 0.488 | 0.827 | 0.610 |
| EvoLearner | Mutag | 0.694 | 0.702 | 0.942 | 0.804 | 0.753 | 0.776 | 0.916 | 0.838 |
| DL Conversion | Mutag | 0.615 | 0.561 | 0.742 | 0.636 | 0.674 | 0.637 | 0.726 | 0.675 |
| RAG Evaluation | Mutag | 0.597 | 0.684 | 0.733 | 0.697 | 0.638 | 0.790 | 0.744 | 0.746 |
| EvoLearner | BGS | 0.393 | 0.363 | 0.980 | 0.528 | 0.366 | 0.333 | 0.978 | 0.492 |
| DL Conversion | BGS | 0.506 | 0.251 | 0.633 | 0.358 | 0.448 | 0.190 | 0.630 | 0.286 |
| RAG Evaluation | BGS | 0.455 | 0.443 | 0.840 | 0.518 | 0.414 | 0.373 | 0.816 | 0.454 |
Results generated using:
python run_experiment.pyClick me!
| Approach | Dataset | Pred Accuracy | Pred Precision | Pred Recall | Pred F1 Score | Exp Accuracy | Exp Precision | Exp Recall | Exp F1 Score |
|---|---|---|---|---|---|---|---|---|---|
| EvoLearner | AIFB | 0.622 | 0.525 | 0.987 | 0.685 | 0.633 | 0.539 | 0.987 | 0.696 |
| DL Conversion | AIFB | 0.611 | 0.417 | 0.800 | 0.549 | 0.622 | 0.431 | 0.800 | 0.560 |
| RAG Evaluation | AIFB | 0.550 | 0.452 | 0.440 | 0.437 | 0.528 | 0.443 | 0.415 | 0.420 |
| EvoLearner | Mutag | 0.694 | 0.690 | 0.982 | 0.810 | 0.685 | 0.680 | 0.981 | 0.802 |
| DL Conversion | Mutag | 0.674 | 0.683 | 0.951 | 0.793 | 0.688 | 0.685 | 0.966 | 0.800 |
| RAG Evaluation | Mutag | 0.638 | 0.683 | 0.858 | 0.754 | 0.676 | 0.705 | 0.891 | 0.781 |
| EvoLearner | BGS | 0.462 | 0.404 | 0.880 | 0.531 | 0.448 | 0.378 | 0.886 | 0.507 |
| DL Conversion | BGS | 0.500 | 0.172 | 0.500 | 0.256 | 0.466 | 0.147 | 0.500 | 0.227 |
| RAG Evaluation | BGS | 0.434 | 0.280 | 0.740 | 0.406 | 0.448 | 0.270 | 0.758 | 0.398 |
Results generated using:
python run_experiment.py --gnn-models RGAT --chat-models gemini-2.0-flash-001Click me!
| Approach | Dataset | Pred Accuracy | Pred Precision | Pred Recall | Pred F1 Score | Exp Accuracy | Exp Precision | Exp Recall | Exp F1 Score |
|---|---|---|---|---|---|---|---|---|---|
| EvoLearner | AIFB | 0.617 | 0.521 | 1.000 | 0.685 | 0.633 | 0.542 | 1.000 | 0.701 |
| DL Conversion | AIFB | 0.578 | 0.501 | 1.000 | 0.666 | 0.594 | 0.525 | 1.000 | 0.683 |
| RAG Evaluation | AIFB | 0.656 | 0.500 | 0.680 | 0.546 | 0.661 | 0.517 | 0.667 | 0.549 |
| EvoLearner | MUTAG | 0.682 | 0.688 | 0.960 | 0.799 | 0.741 | 0.748 | 0.947 | 0.835 |
| DL Conversion | MUTAG | 0.618 | 0.565 | 0.796 | 0.647 | 0.718 | 0.633 | 0.785 | 0.700 |
| RAG Evaluation | MUTAG | 0.606 | 0.680 | 0.751 | 0.704 | 0.671 | 0.776 | 0.769 | 0.761 |
| EvoLearner | BGS | 0.476 | 0.405 | 0.880 | 0.531 | 0.503 | 0.451 | 0.846 | 0.559 |
| DL Conversion | BGS | 0.510 | 0.221 | 0.580 | 0.320 | 0.483 | 0.232 | 0.585 | 0.326 |
| RAG Evaluation | BGS | 0.421 | 0.358 | 0.860 | 0.502 | 0.434 | 0.397 | 0.856 | 0.529 |
This repository contains the code to reproduce the results of the following paper:
@inproceedings{Heindorf2025NLEF,
author = {Stefan Heindorf and
Daniel Neib},
title = {Assessing Natural Language Explanations of Relational Graph Neural Networks},
booktitle = {{CIKM}},
publisher = {{ACM}},
year = {2025}
}Issue:
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObjectCan be resolved by:
pip uninstall numpy
pip uninstall scikit-learn
pip install scikit-learn==1.3.2