Two-Tower Retrieval Model

This repository contains a modular implementation of a Two-Tower (Dual Encoder) neural network for document retrieval.

Features

Modular Design: Each component (tokenization, embedding, encoding) is implemented as a separate module
Config-Driven: All model and training parameters defined in YAML configuration files
Easily Extensible: Adding new tokenizers, embeddings, or encoders only requires implementing a new class
Standard IR Metrics: Comprehensive evaluation metrics (Precision@K, Recall@K, MRR, NDCG)
Unified Search Interface: Common interface for different search implementations
CLI Tools: Command-line tools for building indices and retrieving documents
Docker-based Deployment: For inference

Directory Structure

two-towers/
│
├─ twotower/                   # Core model training code
│   ├─ tokenisers.py           # Tokenization (Stage 1)
│   ├─ embeddings.py           # Embedding layers (Stage 2)
│   ├─ encoders.py             # Encoder towers (Stage 3)
│   ├─ losses.py               # Loss functions (Stage 4)
│   ├─ dataset.py              # Dataset handling
│   ├─ train.py                # Training orchestration
│   ├─ utils.py                # Utilities and helpers
│   └─ evaluate.py             # Evaluation metrics
│
├─ inference/                  # Inference and retrieval code
│   ├─ search/                 # Search implementations
│   │   ├─ base.py             # Base search interface
│   │   ├─ glove.py            # GloVe-based search
│   │   └─ two_tower.py        # Two-Tower model search
│   ├─ cli/                    # Command-line tools
│   │   └─ retrieve.py         # Document retrieval CLI
│   └─ examples/               # Example scripts
│
├─ configs/                    # Configuration files
│   ├─ default_config.yml      # Base configuration
│   ├─ char_tower.yml          # Character-level model config
│   └─ word2vec_skipgram.yml   # Word2Vec embedding config
│
├─ docs/                       # Documentation
├─ tools/                      # Utility scripts
└─ artifacts/                  # Project artifacts and documentation

Getting Started

Installation

Install the package in development mode:

pip install -e .

Training a Model

To train a model using a configuration file:

python train.py --config configs/char_tower.yml

To enable Weights & Biases logging:

python train.py --config configs/char_tower.yml --use_wandb

Retrieving Documents

First, build an index of document vectors:

python -m inference.cli.retrieve build-index --model checkpoints/best_model.pt --documents my_documents.txt --output document_index.pkl

Then, retrieve documents for a query:

python -m inference.cli.retrieve search --model checkpoints/best_model.pt --index document_index.pkl --query "your search query"

Using Search in Python

from inference.search import GloVeSearch, TwoTowerSearch

# GloVe-based search
glove_search = GloVeSearch(model_name='glove-wiki-gigaword-50')
glove_search.index_documents(documents)
results = glove_search.search("query text", top_k=5)

# Two-Tower search
from twotower import load_checkpoint
checkpoint = load_checkpoint("model.pt")
model = checkpoint["model"]
tokenizer = checkpoint["tokenizer"]
two_tower_search = TwoTowerSearch(model, tokenizer)
two_tower_search.index_documents(documents)
results = two_tower_search.search("query text", top_k=5)

Evaluating Models

from twotower.evaluate import evaluate_model, print_evaluation_results

results = evaluate_model(
    model=model,
    test_data=test_data,
    tokenizer=tokenizer,
    metrics=['precision', 'recall', 'mrr', 'ndcg'],
    k_values=[1, 3, 5, 10]
)

print_evaluation_results(results)

Configuration System

The Two-Tower system uses a hierarchical YAML-based configuration system with:

Inheritance: Configs can extend other configs using the extends property
Environment Variables: Override settings with TWOTOWER_ prefixed environment variables
Command-line Overrides: Override configs with command-line arguments

For a complete configuration reference, see artifacts/docs/config.md.

Architecture

The code follows a 5-stage pipeline:

Tokenization: Converts text to token IDs (tokenisers.py)
Embedding: Maps token IDs to dense vectors (embeddings.py)
Encoding: Transforms token embeddings into a single vector (encoders.py)
Loss Function: Defines the training objective (losses.py)
Training: Orchestrates the training process (train.py)

Extending the Model

The modular design makes it easy to extend the model with new components:

Adding a New Tokenizer

Create a new class that inherits from BaseTokeniser in tokenisers.py
Implement required methods and add it to the REGISTRY dictionary

Adding a New Embedding Type

Create a new class that inherits from BaseEmbedding in embeddings.py
Implement required methods and add it to the REGISTRY dictionary

Adding a New Encoder Architecture

Create a new class that inherits from BaseTower in encoders.py
Implement required methods and add it to the TOWER_REGISTRY dictionary

Adding a New Search Implementation

Create a new class that inherits from BaseSearch in inference/search/base.py
Implement required methods

Documentation

Docker Setup

Running the Docker Inference Service

Start the Docker services:

docker compose up -d

Access the web interface at http://localhost:8080
Use the API endpoints to interact with the model:
- POST /add - Add documents to the vector database
- POST /search - Search for similar documents
- POST /embed - Generate embeddings for text

For detailed information about the Docker setup, see the Docker Setup Documentation.

Project Status

Project Status Report

Plan out pipeline:

Plan out data model:

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
artifacts		artifacts
configs		configs
dataset_factory		dataset_factory
inference		inference
legacy_start		legacy_start
presets		presets
reports		reports
tests		tests
tools		tools
twotower		twotower
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
create_report.py		create_report.py
demo-requirements.txt		demo-requirements.txt
docker-compose.yml		docker-compose.yml
eda.ipynb		eda.ipynb
generate_and_train.py		generate_and_train.py
model.py		model.py
prepare_ms_marco.py		prepare_ms_marco.py
requirements.txt		requirements.txt
setup.py		setup.py
streamlit_demo.py		streamlit_demo.py
streamlit_demo_README.md		streamlit_demo_README.md
train.py		train.py
train_with_msmarco.py		train_with_msmarco.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two-Tower Retrieval Model

Features

Directory Structure

Getting Started

Installation

Training a Model

Retrieving Documents

Using Search in Python

Evaluating Models

Configuration System

Architecture

Extending the Model

Adding a New Tokenizer

Adding a New Embedding Type

Adding a New Encoder Architecture

Adding a New Search Implementation

Documentation

Docker Setup

Running the Docker Inference Service

Project Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

k0r1g/two-towers

Folders and files

Latest commit

History

Repository files navigation

Two-Tower Retrieval Model

Features

Directory Structure

Getting Started

Installation

Training a Model

Retrieving Documents

Using Search in Python

Evaluating Models

Configuration System

Architecture

Extending the Model

Adding a New Tokenizer

Adding a New Embedding Type

Adding a New Encoder Architecture

Adding a New Search Implementation

Documentation

Docker Setup

Running the Docker Inference Service

Project Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages