Sequence tagging system to detect negation and uncertainty in clinical text.
It includes three families of methods: rule-based (regex + syntax), CRF, and BiLSTM.
- Rule-based with cues + scope extraction and special colon cases (e.g., “negativo/negativa: …”).
- CRF (classic) with lexical features + context window.
- CRF (POS + embeddings) using spaCy PoS/Tag and first 10 dims of token vectors.
- BiLSTM tagger (IOB) with GPU requirement and class-imbalance weighting.
- Evaluator with position tolerance (±2) and per-label metrics.
.
├─ data/
│ ├─ negacio_train_v2024.json
│ └─ negacio_test_v2024.json
├─ docs/
│ ├─ instructions_slides.pdf
│ └─ Negation_and_Uncertainty_Detection_using_Classical_and_Machine_Learning_Techniques.pdf
├─ base_model.py
├─ Bi_LSTM.py
├─ crf_model.py
├─ crf_pos_we_model.py
├─ data_processing.py
├─ main.py
├─ regex_model.py
├─ syntactic_parsing_model.py
├─ utils.py
- Data (default paths):
data/negacio_train_v2024.json,data/negacio_test_v2024.json. - Docs: slides & associated paper in
docs/. - Preprocessing + splits + cue lexicon building:
data_processing.py. - Common evaluation utilities (precision/recall/F1, tolerance):
base_model.py. - Models:
regex_model.py,syntactic_parsing_model.py,crf_model.py,crf_pos_we_model.py,Bi_LSTM.py. - Main script:
main.py(loads data, trains model, evaluates).
# 1) Clone
git clone https://github.com/AlessandroFornara/NLP-Project.git
cd NLP-Project
# 2) Environment
conda create --name negunc python=3.10.14
conda activate negunc
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121
GPU note (BiLSTM): there is an explicit assert torch.cuda.is_available() - enable CUDA or remove the assert if you must run on CPU.
Input files (train/test) are JSON with Label-Studio-like structure:
{
"data": { "text": "..." },
"predictions": [{
"result": [{
"value": { "start": 10, "end": 15, "labels": ["NEG"] }
}]
}]
}The loader normalizes text, builds lists of texts and gold spans, and collects cleaned NEG/UNC cue lexicons.
python main.pyWhat main.py does:
- Loads train JSON, random train/val split, trains BiLSTM
- Evaluates on validation and then on test (
data/negacio_test_v2024.json).
main.py ships with other models commented out - uncomment the import and instantiate the one you want.
Rule-based (Regex):
from regex_model import RegexModel
model = RegexModel(train_data)
model.train_model() # compiles cue/scope regex- Cues from cleaned NEG/UNC lexicons; scopes stop at punctuation or scope breakers (e.g., “pero”, “aunque”, “sin embargo”, etc.).
- Special case “negativo/negativa:” captures the previous word as scope.
Regex + Syntactic parsing (Spanish/Catalan):
from syntactic_parsing_model import RegexSyntacticModel
model = RegexSyntacticModel(train_data, lang="es") # or "ca"
model.train_model()- Finds cue spans via regex; scope extends up to punctuation or coordinating/advmod tokens using spaCy deps (
cc,conj,advmod).
CRF (classic):
from crf_model import CRFModel
model = CRFModel(train_data, c1=0.1, c2=0.1, max_iterations=100)
model.train_model(save_model=True) # saves trained_crf_model.pkl- Lexical features (
word.lower(), prefixes/suffixes), simple context (±1), plus position ratio.
CRF (POS + embeddings):
from crf_pos_we_model import CRFWithPoSAndEmbeddingsModel
model = CRFWithPoSAndEmbeddingsModel(train_data, c1=0.1, c2=0.1, max_iterations=100)
model.train_model(save_model=True)- Adds spaCy
pos,tagand first 10 vector dims to the feature set.
After training any model, the evaluation calls in
main.pywork the same way.
Use the built-in Evaluator:
- Metrics (micro): Precision, Recall, F1; plus per-label metrics for
NEG,UNC,NSCO,USCO. - Matching: text is normalized (case/char cleanup) and spans are matched with ±2 position tolerance.
- CLI output lists TP, FN, FP examples when
verbose=True.
Example output:
Results on test set:
Precision: 0.78
Recall: 0.72
F1-score: 0.75
Precision, Recall, F1 per label:
NEG: Precision: 0.80, Recall: 0.70, F1: 0.75
...
- Preprocessing removes non-letter characters (preserving accents/apostrophes) and normalizes whitespace; used for cue lexicon and token cleanup.
- Grouping merges contiguous predictions with same label and adjacent offsets.
- Slides / spec (project requirements, NegEx pointer, timelines) are in
instructions_slides.pdf.
- Project brief and slides:
docs/instructions_slides.pdf - Scientific paper:
docs/Negation_and_Uncertainty_Detection_using_Classical_and_Machine_Learning_Techniques.pdf
Fundamentals of Natural Language Processing BSc Artificial Intelligence UAB, 2025