Skip to content
/ NLEF Public

NLEF, "Natural Language Evaluation Framework", is a framework to benchmark natural language explanations for node classification in knowledge graphs.

License

Notifications You must be signed in to change notification settings

ds-jrg/NLEF

Repository files navigation

NLEF: Natural Language Evaluation Framework

Description

NLEF (Natural Language Evaluation Framework) is a framework designed to assess the quality, semantic faithfulness, and reliability of natural language (NL) explanations for node classification tasks performed by Graph Neural Networks (GNNs) on knowledge graphs. This work addresses the gap in rigorous evaluation methods for the increasing use of Large Language Models (LLMs) to generate accessible GNN explanations.

Relation to EDGE Framework

NLEF builds upon and extends the EDGE framework. It utilizes EDGE's datasets, GNN models, and baseline explainers (like EvoLearner) whose outputs can serve as input for NLEF's evaluation processes.

Setup / Installation

Follow these steps to set up the NLEF environment:

Step 1: Clone the Repository

Clone this repository:

git clone https://git.cs.uni-paderborn.de/dneib/nlef nlef_repo
cd nlef_repo

Step 2: Install Conda (if needed)

If you don't have Conda, install it from Anaconda's official website.

Step 3: Create and Activate Conda Environment

Create a conda environment using Python 3.10.13:

conda create --name nlef python=3.10.13 -y
conda activate nlef

(Alternatively, use python -m venv nlef_env && source nlef_env/bin/activate)

Step 4: Install Dependencies

Install the required Python packages:

pip install -r requirements.txt

Step 5: DGL Installation (If Required)

If direct interaction with DGL is needed within NLEF, install the appropriate version. Refer to the official DGL website.

Example for CUDA 11.8 that we used on the Windows machine conducting the experiments below. (adjust based on DGL docs and your system)

pip install dgl -f https://data.dgl.ai/wheels/cu118/repo.html

Respective Linux command:

pip install dgl -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html

Step 6: PyTorch Installation

Based on your GPU / CPU devices, install the suitable version of Torch from official PyTorch Website.

Uninstall previous installations to avoid conflicts:

pip uninstall torch torchvision torchaudio -y

We installed torch version 2.3.0 suitable for CUDA 11.8 (Windows & Linux command):

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia -y

Step 7: Obtain Knowledge Graphs (KGs)

NLEF requires the OWL Knowledge Graphs corresponding to the datasets (AIFB, MUTAG, BGS). Ensure these are available, typically preprocessed via the EDGE framework and placed in data/KGs/.

mkdir -p data/KGs && unzip KGs.zip -d data/KGs/ 

(On Windows open folder in Git Bash to run the command or do it manually)

Step 8: Obtain API Keys

Gemini

OpenAI

Framework Functionality

The NLEF CLI provides commands for setup, NL generation, and NL evaluation. Access commands using python nlef .

Setup

NLEF requires API keys for LLMs (Google Gemini, OpenAI, Mistral). Set them up interactively:

python nlef setup-api-keys

This command helps store keys in a .env file in the project root.

Natural Language Generation Methods

Generate NL from EDGE/EvoLearner DL output

Example:

python nlef generate-from-edge-json --data aifb --model RGCN --llm gpt-4o-mini
Parameter Info

--data: Dataset or file identifier. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: GNN model name. (Default: RGCN)
--method: Explainer method (e.g., EvoLearner). (Default: EvoLearner)
--llm: LLM for translation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--sentence-number (-s): Max sentences used for the Natural Language Expression. (Default: 1)
--interactive (-i): Run interactively. (Default: False)

Generate NL directly using LLM + Ontology sampling

Example:

python nlef generate-natural-language-with-llm --data bgs --model gemini-2.0-flash-001 -g 3
Parameter Info

--data: Dataset name. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: LLM for generation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): Generation temperature. (Default: 0.7)
--max-output-tokens (-mot): Max tokens per output. (Default: 300)
--timeout (-o): LLM timeout (seconds). (Default: 60)
--max-retries (-r): Max LLM retries. (Default: 3)
--sentence-number (-s): Max sentences per generation. (Default: 1)
--generations (-g): Number of generations. (Default: 5)
--interactive (-i): Run interactively. (Default: False)

Generate NL directly using small LM + Ontology sampling

Example:

python nlef generate-sentences-with-lm --data aifb --model EleutherAI/gpt-neo-125M -n 3
Parameter Info

--data: Dataset name. Choices: [aifb, mutag, bgs] (Default: mutag)
--model: Small language model name. (Default: EleutherAI/gpt-neo-125M)
--temperature (-t): Generation temperature. (Default: 0.7)
--max-length (-l): Max sentence length. (Default: 30)
--number-of-sentences (-n): Number of sentences to generate. (Default: 5)
--interactive (-i): Run interactively. (Default: False)

Natural Language Evaluation Methods

Evaluate NL using Approach 1 (DL Conversion)

Example:

python nlef evaluate-with-description-logic -d mutag -p test --nl-input nlef/NLEF/generated_natural_language/mutag_1.txt --save-results json
Parameter Info

--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM for NL-to-DL conversion. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0.0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 5)
--nl-input (-n): NL expression string or path to file (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)

Evaluate NL using Approach 2 (RAG - Single Instance Processing)

Example:

python nlef evaluate-with-rag -d aifb -n "Each project is financed by at most two publications that describe it."
Parameter Info

--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM used in RAG evaluation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 5)
--nl-input (-n): NL expression string or path to file (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--depth-limit (-de): Max retrieval depth. (Default: 4)
--doc-limit (-dl): Max retrieved documents. (Default: 100)
--max-workers (-w): Max parallel workers. (Default: 4)
--max-tokens (-mt): Max tokens for retrieval (potential LLM context limit). (Default: 100000)
--summarize (-sum): Summarize retrieved docs. (Default: False)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--verbose (-v): Print verbose output i.e. LLM reasoning. (Default: False)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)

Evaluate NL using Approach 2 (RAG - Batch Processing)

Example:

python nlef evaluate-with-rag-batch --data bgs --nl-input nlef/NLEF/generated_natural_language/bgs_1.txt
Parameter Info

--data (-d): Dataset for evaluation. Choices: [aifb, mutag, bgs] (Default: mutag)
--partition (-p): Dataset partition. Choices: [train, test, valid, all] (Default: test)
--llm (-l): LLM used in RAG evaluation. Choices: [gemini-2.0-flash-001, gemini-1.5-flash, gpt-4o-mini, open-mistral-nemo] (Default: gemini-2.0-flash-001)
--temperature (-t): LLM temperature. (Default: 0)
--timeout (-o): LLM timeout (seconds). (Default: 120)
--max-retries (-r): Max LLM retries. (Default: 10)
--nl-input (-n): Path to file containing NL expressions (one per line). (Default: path/to/nl_expressions.txt-or-single-nl_expression)
--depth-limit (-de): Max retrieval depth. (Default: 4)
--doc-limit (-dl): Max retrieved documents. (Default: 100)
--max-workers (-w): Max parallel workers. (Default: 4)
--max-tokens (-mt): Max tokens for retrieval (potential LLM context limit). (Default: 1000000)
--summarize (-sum): Summarize retrieved docs. (Default: False)
--save-results (-s): Save format (json, csv, txt, html). (Default: json)
--verbose (-v): Print verbose output i.e. LLM reasoning. (Default: False)
--calc-metrics-against-gnn (-c): Path to baseline results JSON for metrics. (Default: path/to/dataframe.json)

For detailed information on all available options and arguments for any command, use the --help flag:

python nlef <command> --help

Example: python nlef evaluate-with-rag-batch --help

Experiments

The NLEF framework was evaluated using the following benchmark datasets from EDGE:

  • AIFB
  • MUTAG
  • BGS

The primary LLMs used for evaluation were:

  • Google Gemini-2.0-Flash
  • OpenAI GPT-4o-mini

Detailed descriptions of the experimental setup, procedures, performance metrics, results, and discussion can be found in the accompanying Master Thesis.

Running the Full Experiment Pipeline

The run_experiment.py script automates the process of running the baseline (EvoLearner via EDGE's main.py), generating natural language explanations from the baseline output, and evaluating these explanations using specified NLEF approaches and LLMs over multiple iterations.

Example Usage:

python run_experiment.py --datasets aifb mutag bgs --gnn-models RGCN --approaches description-logic rag-batch --chat-models gemini-2.0-flash-001 --iterations 5

RGCN Results using gemini-2.0-flash-001

Results generated using:

python run_experiment.py --chat-models gemini-2.0-flash-001
Click me!
Approach Dataset Pred Accuracy Pred Precision Pred Recall Pred F1 Score Exp Accuracy Exp Precision Exp Recall Exp F1 Score
EvoLearner AIFB 0.639 0.538 0.987 0.696 0.650 0.555 0.988 0.708
DL Conversion AIFB 0.533 0.477 1.000 0.644 0.544 0.488 1.000 0.655
RAG Evaluation AIFB 0.544 0.464 0.813 0.587 0.567 0.488 0.827 0.610
EvoLearner Mutag 0.694 0.702 0.942 0.804 0.753 0.776 0.916 0.838
DL Conversion Mutag 0.615 0.561 0.742 0.636 0.674 0.637 0.726 0.675
RAG Evaluation Mutag 0.597 0.684 0.733 0.697 0.638 0.790 0.744 0.746
EvoLearner BGS 0.393 0.363 0.980 0.528 0.366 0.333 0.978 0.492
DL Conversion BGS 0.506 0.251 0.633 0.358 0.448 0.190 0.630 0.286
RAG Evaluation BGS 0.455 0.443 0.840 0.518 0.414 0.373 0.816 0.454

RGCN Results using gpt-4o-mini

Results generated using:

python run_experiment.py
Click me!
Approach Dataset Pred Accuracy Pred Precision Pred Recall Pred F1 Score Exp Accuracy Exp Precision Exp Recall Exp F1 Score
EvoLearner AIFB 0.622 0.525 0.987 0.685 0.633 0.539 0.987 0.696
DL Conversion AIFB 0.611 0.417 0.800 0.549 0.622 0.431 0.800 0.560
RAG Evaluation AIFB 0.550 0.452 0.440 0.437 0.528 0.443 0.415 0.420
EvoLearner Mutag 0.694 0.690 0.982 0.810 0.685 0.680 0.981 0.802
DL Conversion Mutag 0.674 0.683 0.951 0.793 0.688 0.685 0.966 0.800
RAG Evaluation Mutag 0.638 0.683 0.858 0.754 0.676 0.705 0.891 0.781
EvoLearner BGS 0.462 0.404 0.880 0.531 0.448 0.378 0.886 0.507
DL Conversion BGS 0.500 0.172 0.500 0.256 0.466 0.147 0.500 0.227
RAG Evaluation BGS 0.434 0.280 0.740 0.406 0.448 0.270 0.758 0.398

RGAT Results using gemini-2.0-flash-001

Results generated using:

python run_experiment.py --gnn-models RGAT --chat-models gemini-2.0-flash-001
Click me!
Approach Dataset Pred Accuracy Pred Precision Pred Recall Pred F1 Score Exp Accuracy Exp Precision Exp Recall Exp F1 Score
EvoLearner AIFB 0.617 0.521 1.000 0.685 0.633 0.542 1.000 0.701
DL Conversion AIFB 0.578 0.501 1.000 0.666 0.594 0.525 1.000 0.683
RAG Evaluation AIFB 0.656 0.500 0.680 0.546 0.661 0.517 0.667 0.549
EvoLearner MUTAG 0.682 0.688 0.960 0.799 0.741 0.748 0.947 0.835
DL Conversion MUTAG 0.618 0.565 0.796 0.647 0.718 0.633 0.785 0.700
RAG Evaluation MUTAG 0.606 0.680 0.751 0.704 0.671 0.776 0.769 0.761
EvoLearner BGS 0.476 0.405 0.880 0.531 0.503 0.451 0.846 0.559
DL Conversion BGS 0.510 0.221 0.580 0.320 0.483 0.232 0.585 0.326
RAG Evaluation BGS 0.421 0.358 0.860 0.502 0.434 0.397 0.856 0.529

Citation

This repository contains the code to reproduce the results of the following paper:

@inproceedings{Heindorf2025NLEF,
  author       = {Stefan Heindorf and
                         Daniel Neib},
  title        = {Assessing Natural Language Explanations of Relational Graph Neural Networks},
  booktitle    = {{CIKM}},
  publisher    = {{ACM}},
  year         = {2025}
}

Known Issues

Issue:

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Can be resolved by:

pip uninstall numpy
pip uninstall scikit-learn
pip install scikit-learn==1.3.2

About

NLEF, "Natural Language Evaluation Framework", is a framework to benchmark natural language explanations for node classification in knowledge graphs.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •