Negation & Uncertainty Detection (Spanish/Catalan)

Sequence tagging system to detect negation and uncertainty in clinical text.
It includes three families of methods: rule-based (regex + syntax), CRF, and BiLSTM.

Overview

Rule-based with cues + scope extraction and special colon cases (e.g., “negativo/negativa: …”).
CRF (classic) with lexical features + context window.
CRF (POS + embeddings) using spaCy PoS/Tag and first 10 dims of token vectors.
BiLSTM tagger (IOB) with GPU requirement and class-imbalance weighting.
Evaluator with position tolerance (±2) and per-label metrics.

Repository structure

.
├─ data/
│  ├─ negacio_train_v2024.json
│  └─ negacio_test_v2024.json
├─ docs/
│  ├─ instructions_slides.pdf
│  └─ Negation_and_Uncertainty_Detection_using_Classical_and_Machine_Learning_Techniques.pdf
├─ base_model.py
├─ Bi_LSTM.py
├─ crf_model.py
├─ crf_pos_we_model.py
├─ data_processing.py
├─ main.py
├─ regex_model.py
├─ syntactic_parsing_model.py
├─ utils.py

Data (default paths): data/negacio_train_v2024.json, data/negacio_test_v2024.json.
Docs: slides & associated paper in docs/.
Preprocessing + splits + cue lexicon building: data_processing.py.
Common evaluation utilities (precision/recall/F1, tolerance): base_model.py.
Models: regex_model.py, syntactic_parsing_model.py, crf_model.py, crf_pos_we_model.py, Bi_LSTM.py.
Main script: main.py (loads data, trains model, evaluates).

Setup

# 1) Clone
git clone https://github.com/AlessandroFornara/NLP-Project.git
cd NLP-Project

# 2) Environment
conda create --name negunc python=3.10.14
conda activate negunc
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121

GPU note (BiLSTM): there is an explicit assert torch.cuda.is_available() - enable CUDA or remove the assert if you must run on CPU.

Data format

Input files (train/test) are JSON with Label-Studio-like structure:

{
  "data": { "text": "..." },
  "predictions": [{
    "result": [{
      "value": { "start": 10, "end": 15, "labels": ["NEG"] }
    }]
  }]
}

The loader normalizes text, builds lists of texts and gold spans, and collects cleaned NEG/UNC cue lexicons.

Quick start

Default: BiLSTM (train → validate → test)

python main.py

What main.py does:

Loads train JSON, random train/val split, trains BiLSTM
Evaluates on validation and then on test (data/negacio_test_v2024.json).

Switching models

main.py ships with other models commented out - uncomment the import and instantiate the one you want.

Rule-based (Regex):

from regex_model import RegexModel
model = RegexModel(train_data)
model.train_model()  # compiles cue/scope regex

Cues from cleaned NEG/UNC lexicons; scopes stop at punctuation or scope breakers (e.g., “pero”, “aunque”, “sin embargo”, etc.).
Special case “negativo/negativa:” captures the previous word as scope.

Regex + Syntactic parsing (Spanish/Catalan):

from syntactic_parsing_model import RegexSyntacticModel
model = RegexSyntacticModel(train_data, lang="es")  # or "ca"
model.train_model()

Finds cue spans via regex; scope extends up to punctuation or coordinating/advmod tokens using spaCy deps (cc, conj, advmod).

CRF (classic):

from crf_model import CRFModel
model = CRFModel(train_data, c1=0.1, c2=0.1, max_iterations=100)
model.train_model(save_model=True)  # saves trained_crf_model.pkl

Lexical features (word.lower(), prefixes/suffixes), simple context (±1), plus position ratio.

CRF (POS + embeddings):

from crf_pos_we_model import CRFWithPoSAndEmbeddingsModel
model = CRFWithPoSAndEmbeddingsModel(train_data, c1=0.1, c2=0.1, max_iterations=100)
model.train_model(save_model=True)

Adds spaCy pos, tag and first 10 vector dims to the feature set.

After training any model, the evaluation calls in main.py work the same way.

Evaluation

Use the built-in Evaluator:

Metrics (micro): Precision, Recall, F1; plus per-label metrics for NEG, UNC, NSCO, USCO.
Matching: text is normalized (case/char cleanup) and spans are matched with ±2 position tolerance.
CLI output lists TP, FN, FP examples when verbose=True.

Example output:

Results on test set:
Precision: 0.78
Recall:    0.72
F1-score:  0.75

Precision, Recall, F1 per label:
NEG: Precision: 0.80, Recall: 0.70, F1: 0.75
...

Implementation notes

Preprocessing removes non-letter characters (preserving accents/apostrophes) and normalizes whitespace; used for cue lexicon and token cleanup.
Grouping merges contiguous predictions with same label and adjacent offsets.
Slides / spec (project requirements, NegEx pointer, timelines) are in instructions_slides.pdf.

References

Project brief and slides: docs/instructions_slides.pdf
Scientific paper: docs/Negation_and_Uncertainty_Detection_using_Classical_and_Machine_Learning_Techniques.pdf

Contributors

Fundamentals of Natural Language Processing BSc Artificial Intelligence UAB, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Negation & Uncertainty Detection (Spanish/Catalan)

Overview

Repository structure

Setup

Data format

Quick start

Default: BiLSTM (train → validate → test)

Switching models

Evaluation

Implementation notes

References

Contributors

About

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
docs		docs
.gitignore		.gitignore
Bi_LSTM.py		Bi_LSTM.py
LICENSE		LICENSE
README.md		README.md
base_model.py		base_model.py
crf_model.py		crf_model.py
crf_pos_we_model.py		crf_pos_we_model.py
data_processing.py		data_processing.py
main.py		main.py
optuna_optimization.py		optuna_optimization.py
regex_model.py		regex_model.py
requirements.txt		requirements.txt
syntactic_parsing_model.py		syntactic_parsing_model.py
utils.py		utils.py

License

AlessandroFornara/NLP-Project

Folders and files

Latest commit

History

Repository files navigation

Negation & Uncertainty Detection (Spanish/Catalan)

Overview

Repository structure

Setup

Data format

Quick start

Default: BiLSTM (train → validate → test)

Switching models

Evaluation

Implementation notes

References

Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages