This repository contains a modular implementation of a Two-Tower (Dual Encoder) neural network for document retrieval.
- Modular Design: Each component (tokenization, embedding, encoding) is implemented as a separate module
- Config-Driven: All model and training parameters defined in YAML configuration files
- Easily Extensible: Adding new tokenizers, embeddings, or encoders only requires implementing a new class
- Standard IR Metrics: Comprehensive evaluation metrics (Precision@K, Recall@K, MRR, NDCG)
- Unified Search Interface: Common interface for different search implementations
- CLI Tools: Command-line tools for building indices and retrieving documents
- Docker-based Deployment: For inference
two-towers/
│
├─ twotower/ # Core model training code
│ ├─ tokenisers.py # Tokenization (Stage 1)
│ ├─ embeddings.py # Embedding layers (Stage 2)
│ ├─ encoders.py # Encoder towers (Stage 3)
│ ├─ losses.py # Loss functions (Stage 4)
│ ├─ dataset.py # Dataset handling
│ ├─ train.py # Training orchestration
│ ├─ utils.py # Utilities and helpers
│ └─ evaluate.py # Evaluation metrics
│
├─ inference/ # Inference and retrieval code
│ ├─ search/ # Search implementations
│ │ ├─ base.py # Base search interface
│ │ ├─ glove.py # GloVe-based search
│ │ └─ two_tower.py # Two-Tower model search
│ ├─ cli/ # Command-line tools
│ │ └─ retrieve.py # Document retrieval CLI
│ └─ examples/ # Example scripts
│
├─ configs/ # Configuration files
│ ├─ default_config.yml # Base configuration
│ ├─ char_tower.yml # Character-level model config
│ └─ word2vec_skipgram.yml # Word2Vec embedding config
│
├─ docs/ # Documentation
├─ tools/ # Utility scripts
└─ artifacts/ # Project artifacts and documentation
Install the package in development mode:
pip install -e .To train a model using a configuration file:
python train.py --config configs/char_tower.ymlTo enable Weights & Biases logging:
python train.py --config configs/char_tower.yml --use_wandbFirst, build an index of document vectors:
python -m inference.cli.retrieve build-index --model checkpoints/best_model.pt --documents my_documents.txt --output document_index.pklThen, retrieve documents for a query:
python -m inference.cli.retrieve search --model checkpoints/best_model.pt --index document_index.pkl --query "your search query"from inference.search import GloVeSearch, TwoTowerSearch
# GloVe-based search
glove_search = GloVeSearch(model_name='glove-wiki-gigaword-50')
glove_search.index_documents(documents)
results = glove_search.search("query text", top_k=5)
# Two-Tower search
from twotower import load_checkpoint
checkpoint = load_checkpoint("model.pt")
model = checkpoint["model"]
tokenizer = checkpoint["tokenizer"]
two_tower_search = TwoTowerSearch(model, tokenizer)
two_tower_search.index_documents(documents)
results = two_tower_search.search("query text", top_k=5)from twotower.evaluate import evaluate_model, print_evaluation_results
results = evaluate_model(
model=model,
test_data=test_data,
tokenizer=tokenizer,
metrics=['precision', 'recall', 'mrr', 'ndcg'],
k_values=[1, 3, 5, 10]
)
print_evaluation_results(results)The Two-Tower system uses a hierarchical YAML-based configuration system with:
- Inheritance: Configs can extend other configs using the
extendsproperty - Environment Variables: Override settings with
TWOTOWER_prefixed environment variables - Command-line Overrides: Override configs with command-line arguments
For a complete configuration reference, see artifacts/docs/config.md.
The code follows a 5-stage pipeline:
- Tokenization: Converts text to token IDs (
tokenisers.py) - Embedding: Maps token IDs to dense vectors (
embeddings.py) - Encoding: Transforms token embeddings into a single vector (
encoders.py) - Loss Function: Defines the training objective (
losses.py) - Training: Orchestrates the training process (
train.py)
The modular design makes it easy to extend the model with new components:
- Create a new class that inherits from
BaseTokeniserintokenisers.py - Implement required methods and add it to the
REGISTRYdictionary
- Create a new class that inherits from
BaseEmbeddinginembeddings.py - Implement required methods and add it to the
REGISTRYdictionary
- Create a new class that inherits from
BaseTowerinencoders.py - Implement required methods and add it to the
TOWER_REGISTRYdictionary
- Create a new class that inherits from
BaseSearchininference/search/base.py - Implement required methods
- Start the Docker services:
docker compose up -d-
Access the web interface at http://localhost:8080
-
Use the API endpoints to interact with the model:
- POST
/add- Add documents to the vector database - POST
/search- Search for similar documents - POST
/embed- Generate embeddings for text
- POST
For detailed information about the Docker setup, see the Docker Setup Documentation.
Plan out pipeline:
Plan out data model: