A production-grade, end-to-end abstractive text summarization system built with FastAPI, Hugging Face Transformers, and AWS S3 — featuring a fully fine-tuned Google Pegasus LLM, modular pipelines, DVC data versioning, MLflow experiment tracking, and centralized logging/exception handling.
The Text Summarizer project is a complete MLOps-ready solution for abstractive text summarization. It automates:
- Data Ingestion — Downloading, extracting, and organizing datasets from URLs or cloud storage.
- Data Validation — Schema checks, quality checks, and reproducibility safeguards.
- Data Transformation — Tokenization, cleaning, and preparation for model training.
- Model Training — Full fine-tuning of Google Pegasus LLM for domain-specific summarization.
- Evaluation — ROUGE, BLEU, and loss tracking with MLflow.
- Deployment — FastAPI endpoints and a Jinja2-powered web interface for predictions.
Key design principles:
- Modular, maintainable architecture.
- YAML-driven configuration for reproducibility.
- Dual local/S3 storage support.
- Centralized logging and exception handling.
- DVC for dataset versioning.
- Timestamped artifacts for idempotent runs.
-
Fully fine-tuned Google Pegasus LLM for abstractive summarization.
-
Modular pipelines:
- Data Ingestion (Local/S3, ZIP extraction, DVC sync)
- Data Validation
- Data Transformation
- Model Training
- Model Evaluation
- Prediction
-
MLflow experiment tracking and metrics logging.
-
Local and AWS S3 artifact storage.
-
Web UI with Jinja2 templates.
-
Configurable via
config.yaml,params.yaml,schema.yaml,templates.yaml.
text-summarizer/
├── app.py # FastAPI application entry point
├── config/ # YAML configuration files
├── data/ # DVC-tracked datasets
├── artifacts/ # Timestamped pipeline artifacts
├── logs/ # Centralized logs
├── templates/ # Web UI templates
├── requirements.txt # Python dependencies
└── src/textsummarizer/ # Core source package
├── components/ # Pipeline stages
├── config/ # Config manager
├── constants/ # Path constants
├── utils/ # Utility functions
├── exception/ # Custom exceptions
├── logging/ # Logger setup
├── pipelines/ # Pipeline orchestration modules
Raw Data → Data Ingestion → Data Validation → Data Transformation → Model Training → Model Evaluation → Deployment
Each stage produces structured artifacts and logs for reproducibility.
All settings are parameterized via YAML and .env files.
YAML Configs:
config.yaml— Paths, artifact locations, storage settings.params.yaml— Training parameters, hyperparameters.schema.yaml— Dataset schema definitions.templates.yaml— Report templates.
Environment Variables:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=
S3_BUCKET_NAME=# Start the FastAPI app
uvicorn app:app --reloadAccess UI: http://localhost:8000
- Experiment:
TextSummarizerExperiment - Model Registry:
PegasusTextSummarizer - Metrics: ROUGE, BLEU, Loss
mlflow uiVisit: http://localhost:5000
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Web UI home page |
| POST | /train |
Trigger Pegasus model fine-tuning |
| POST | /predict |
Generate abstractive summaries |
This project is licensed under the MIT License.
Gokul Krishna N V Machine Learning Engineer — UK 🇬🇧 GitHub • LinkedIn
- Google Pegasus model: Hugging Face Transformers