📝 Text Summarizer — Google Pegasus LLM Fine-Tuning Pipeline

A production-grade, end-to-end abstractive text summarization system built with FastAPI, Hugging Face Transformers, and AWS S3 — featuring a fully fine-tuned Google Pegasus LLM, modular pipelines, DVC data versioning, MLflow experiment tracking, and centralized logging/exception handling.

📌 Overview

The Text Summarizer project is a complete MLOps-ready solution for abstractive text summarization. It automates:

Data Ingestion — Downloading, extracting, and organizing datasets from URLs or cloud storage.
Data Validation — Schema checks, quality checks, and reproducibility safeguards.
Data Transformation — Tokenization, cleaning, and preparation for model training.
Model Training — Full fine-tuning of Google Pegasus LLM for domain-specific summarization.
Evaluation — ROUGE, BLEU, and loss tracking with MLflow.
Deployment — FastAPI endpoints and a Jinja2-powered web interface for predictions.

Key design principles:

Modular, maintainable architecture.
YAML-driven configuration for reproducibility.
Dual local/S3 storage support.
Centralized logging and exception handling.
DVC for dataset versioning.
Timestamped artifacts for idempotent runs.

🚀 Features

Fully fine-tuned Google Pegasus LLM for abstractive summarization.
Modular pipelines:
- Data Ingestion (Local/S3, ZIP extraction, DVC sync)
- Data Validation
- Data Transformation
- Model Training
- Model Evaluation
- Prediction
MLflow experiment tracking and metrics logging.
Local and AWS S3 artifact storage.
Web UI with Jinja2 templates.
Configurable via config.yaml, params.yaml, schema.yaml, templates.yaml.

📂 Project Structure

text-summarizer/
├── app.py                   # FastAPI application entry point
├── config/                  # YAML configuration files
├── data/                    # DVC-tracked datasets
├── artifacts/               # Timestamped pipeline artifacts
├── logs/                    # Centralized logs
├── templates/               # Web UI templates
├── requirements.txt         # Python dependencies
└── src/textsummarizer/      # Core source package
    ├── components/          # Pipeline stages
    ├── config/              # Config manager
    ├── constants/           # Path constants
    ├── utils/               # Utility functions
    ├── exception/           # Custom exceptions
    ├── logging/             # Logger setup
    ├── pipelines/           # Pipeline orchestration modules

🔁 Pipeline Flow

Raw Data → Data Ingestion → Data Validation → Data Transformation → Model Training → Model Evaluation → Deployment

Each stage produces structured artifacts and logs for reproducibility.

⚙️ Configuration

All settings are parameterized via YAML and .env files.

YAML Configs:

config.yaml — Paths, artifact locations, storage settings.
params.yaml — Training parameters, hyperparameters.
schema.yaml — Dataset schema definitions.
templates.yaml — Report templates.

Environment Variables:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=
S3_BUCKET_NAME=

🧪 Running the Project

# Start the FastAPI app
uvicorn app:app --reload

Access UI: http://localhost:8000

📈 MLflow Tracking

Experiment: TextSummarizerExperiment
Model Registry: PegasusTextSummarizer
Metrics: ROUGE, BLEU, Loss

mlflow ui

Visit: http://localhost:5000

🌐 FastAPI Endpoints

Method	Endpoint	Description
GET	`/`	Web UI home page
POST	`/train`	Trigger Pegasus model fine-tuning
POST	`/predict`	Generate abstractive summaries

📝 License

This project is licensed under the MIT License.

👨‍💻 Author

Gokul Krishna N V Machine Learning Engineer — UK 🇬🇧 GitHub • LinkedIn

🙌 Acknowledgements

Google Pegasus model: Hugging Face Transformers

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
app/templates		app/templates
config		config
input_data		input_data
predictions/20250808T142726Z		predictions/20250808T142726Z
src/textsummarizer		src/textsummarizer
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
buildspec.yaml		buildspec.yaml
debug.py		debug.py
docker-compose.yml		docker-compose.yml
project_dump.py		project_dump.py
project_structure.txt		project_structure.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📝 Text Summarizer — Google Pegasus LLM Fine-Tuning Pipeline

📌 Overview

🚀 Features

📂 Project Structure

🔁 Pipeline Flow

⚙️ Configuration

🧪 Running the Project

📈 MLflow Tracking

🌐 FastAPI Endpoints

📝 License

👨‍💻 Author

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

megokul/text-summarizer

Folders and files

Latest commit

History

Repository files navigation

📝 Text Summarizer — Google Pegasus LLM Fine-Tuning Pipeline

📌 Overview

🚀 Features

📂 Project Structure

🔁 Pipeline Flow

⚙️ Configuration

🧪 Running the Project

📈 MLflow Tracking

🌐 FastAPI Endpoints

📝 License

👨‍💻 Author

🙌 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages