Skip to content
This repository was archived by the owner on Sep 22, 2025. It is now read-only.

Project of the Fundamentals of Natural Language course (Universitat Autònoma de Barcelona, 2024/25 A.Y.)

License

Notifications You must be signed in to change notification settings

AlessandroFornara/NLP-Project

Repository files navigation

Negation & Uncertainty Detection (Spanish/Catalan)

Sequence tagging system to detect negation and uncertainty in clinical text.
It includes three families of methods: rule-based (regex + syntax), CRF, and BiLSTM.


Overview

  • Rule-based with cues + scope extraction and special colon cases (e.g., “negativo/negativa: …”).
  • CRF (classic) with lexical features + context window.
  • CRF (POS + embeddings) using spaCy PoS/Tag and first 10 dims of token vectors.
  • BiLSTM tagger (IOB) with GPU requirement and class-imbalance weighting.
  • Evaluator with position tolerance (±2) and per-label metrics.

Repository structure

.
├─ data/
│  ├─ negacio_train_v2024.json
│  └─ negacio_test_v2024.json
├─ docs/
│  ├─ instructions_slides.pdf
│  └─ Negation_and_Uncertainty_Detection_using_Classical_and_Machine_Learning_Techniques.pdf
├─ base_model.py
├─ Bi_LSTM.py
├─ crf_model.py
├─ crf_pos_we_model.py
├─ data_processing.py
├─ main.py
├─ regex_model.py
├─ syntactic_parsing_model.py
├─ utils.py
  • Data (default paths): data/negacio_train_v2024.json, data/negacio_test_v2024.json.
  • Docs: slides & associated paper in docs/.
  • Preprocessing + splits + cue lexicon building: data_processing.py.
  • Common evaluation utilities (precision/recall/F1, tolerance): base_model.py.
  • Models: regex_model.py, syntactic_parsing_model.py, crf_model.py, crf_pos_we_model.py, Bi_LSTM.py.
  • Main script: main.py (loads data, trains model, evaluates).

Setup

# 1) Clone
git clone https://github.com/AlessandroFornara/NLP-Project.git
cd NLP-Project

# 2) Environment
conda create --name negunc python=3.10.14
conda activate negunc
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121

GPU note (BiLSTM): there is an explicit assert torch.cuda.is_available() - enable CUDA or remove the assert if you must run on CPU.


Data format

Input files (train/test) are JSON with Label-Studio-like structure:

{
  "data": { "text": "..." },
  "predictions": [{
    "result": [{
      "value": { "start": 10, "end": 15, "labels": ["NEG"] }
    }]
  }]
}

The loader normalizes text, builds lists of texts and gold spans, and collects cleaned NEG/UNC cue lexicons.


Quick start

Default: BiLSTM (train → validate → test)

python main.py

What main.py does:

  • Loads train JSON, random train/val split, trains BiLSTM
  • Evaluates on validation and then on test (data/negacio_test_v2024.json).

Switching models

main.py ships with other models commented out - uncomment the import and instantiate the one you want.

Rule-based (Regex):

from regex_model import RegexModel
model = RegexModel(train_data)
model.train_model()  # compiles cue/scope regex
  • Cues from cleaned NEG/UNC lexicons; scopes stop at punctuation or scope breakers (e.g., “pero”, “aunque”, “sin embargo”, etc.).
  • Special case “negativo/negativa:” captures the previous word as scope.

Regex + Syntactic parsing (Spanish/Catalan):

from syntactic_parsing_model import RegexSyntacticModel
model = RegexSyntacticModel(train_data, lang="es")  # or "ca"
model.train_model()
  • Finds cue spans via regex; scope extends up to punctuation or coordinating/advmod tokens using spaCy deps (cc, conj, advmod).

CRF (classic):

from crf_model import CRFModel
model = CRFModel(train_data, c1=0.1, c2=0.1, max_iterations=100)
model.train_model(save_model=True)  # saves trained_crf_model.pkl
  • Lexical features (word.lower(), prefixes/suffixes), simple context (±1), plus position ratio.

CRF (POS + embeddings):

from crf_pos_we_model import CRFWithPoSAndEmbeddingsModel
model = CRFWithPoSAndEmbeddingsModel(train_data, c1=0.1, c2=0.1, max_iterations=100)
model.train_model(save_model=True)
  • Adds spaCy pos, tag and first 10 vector dims to the feature set.

After training any model, the evaluation calls in main.py work the same way.


Evaluation

Use the built-in Evaluator:

  • Metrics (micro): Precision, Recall, F1; plus per-label metrics for NEG, UNC, NSCO, USCO.
  • Matching: text is normalized (case/char cleanup) and spans are matched with ±2 position tolerance.
  • CLI output lists TP, FN, FP examples when verbose=True.

Example output:

Results on test set:
Precision: 0.78
Recall:    0.72
F1-score:  0.75

Precision, Recall, F1 per label:
NEG: Precision: 0.80, Recall: 0.70, F1: 0.75
...

Implementation notes

  • Preprocessing removes non-letter characters (preserving accents/apostrophes) and normalizes whitespace; used for cue lexicon and token cleanup.
  • Grouping merges contiguous predictions with same label and adjacent offsets.
  • Slides / spec (project requirements, NegEx pointer, timelines) are in instructions_slides.pdf.

References

  • Project brief and slides: docs/instructions_slides.pdf
  • Scientific paper: docs/Negation_and_Uncertainty_Detection_using_Classical_and_Machine_Learning_Techniques.pdf

Contributors

Fundamentals of Natural Language Processing BSc Artificial Intelligence UAB, 2025

About

Project of the Fundamentals of Natural Language course (Universitat Autònoma de Barcelona, 2024/25 A.Y.)

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •  

Languages