Skip to content

Automated NLP pipeline for news analysis with sentiment detection, topic modeling & summarization. Daily updates via GitHub Actions, 24/7 Streamlit dashboard on Railway.

Notifications You must be signed in to change notification settings

davidjosipovic/news-trend-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ฐ News Trend Analysis

Automated news aggregation and analysis system with sentiment analysis, topic modeling, and interactive visualization dashboard.

Live Demo Python License

๐ŸŽ“ University Projects:

  • MOPJ (NLP): Automated NLP pipeline with sentiment analysis, topic modeling, and summarization
  • PI (Business Intelligence): Predictive analytics dashboard with comprehensive evaluation system

๐ŸŽฏ Business Problem

Challenge: Manual tracking of economic news sentiment is time-consuming and subjective.

Solution: Automated AI-powered pipeline that:

  • Fetches economic news every 12 hours
  • Analyzes sentiment with 76% confidence (FinBERT)
  • Discovers trending topics automatically (BERTopic)
  • Generates concise summaries (37.7x compression)
  • Visualizes insights in interactive dashboard

Business Value:

  • ๐Ÿ“ˆ Investors: Real-time market sentiment tracking
  • ๐Ÿ“ฐ Media: Identify trending topics and narratives
  • ๐Ÿ“Š Analysts: Automated research assistance

๐ŸŒŸ Features

  • ๐Ÿค– Automated News Collection: Fetches latest economic news from NewsData.io API
  • ๐Ÿ•ท๏ธ Web Scraping: Extracts full article content from news websites
  • ๐Ÿง  Advanced NLP Analysis:
    • Sentiment Analysis: FinBERT transformer model (Prosus AI)
    • Topic Modeling: BERTopic with HDBSCAN clustering
    • Automatic Summarization: DistilBART (CNN-trained)
  • ๐Ÿ“Š Interactive Dashboard: Real-time visualization with Streamlit + Plotly
  • โšก Automated Pipeline: GitHub Actions runs twice daily (8:00 & 20:00 UTC)
  • ๐ŸŽฏ Quality Filtering: Removes paid content and short articles (< 200 words)
  • ๐Ÿ”„ Duplicate Detection: Smart handling of cross-source articles
  • ๐Ÿ”ฎ Predictive Analytics (NEW):
    • Weekly Forecasting: Elastic Net + XGBoost for sentiment/volume predictions
    • Spike Detection: ML-based anomaly detection with SMOTE balancing
    • Feature Engineering: Lag, rolling, calendar, and trend features
    • REST API: FastAPI endpoints for predictions

๐Ÿš€ Live Demo

Dashboard: https://newstrendanalysis.up.railway.app/

๐Ÿ“ธ Screenshots

Dashboard Overview

  • 4 Key Metrics: Total articles, unique articles, sentiment, topics
  • Interactive Charts: Sentiment distribution, topic clustering, time series
  • Smart Pagination: Browse articles with customizable page size
  • Advanced Filtering: By sentiment, topic, and date

Features

  • โœ… Sentiment Over Time: Track market sentiment trends
  • โœ… Topic Distribution: Visualize news themes
  • โœ… Article Summaries: AI-generated summaries for quick insights
  • โœ… Duplicate Toggle: Show/hide articles from multiple sources

๐Ÿ› ๏ธ Tech Stack

Category Technologies
Language Python 3.11+
NLP Models Transformers (FinBERT, DistilBART), BERTopic, Sentence-Transformers
Predictive ML XGBoost, Elastic Net, SMOTE (imbalanced-learn), Optuna
Dashboard Streamlit, Plotly
API FastAPI, Pydantic
Deployment Railway (dashboard), GitHub Actions (pipeline)
Data Source NewsData.io API
Web Scraping Newspaper3k, BeautifulSoup

๐Ÿ“ฆ Installation

Prerequisites

Quick Start

  1. Clone Repository
git clone https://github.com/davidjosipovic/news-trend-analysis.git
cd news-trend-analysis
  1. Create Virtual Environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate  # Windows
  1. Install Dependencies

For full pipeline (includes NLP models ~2GB):

pip install -r requirements.full.txt

For dashboard only (lightweight ~50MB):

pip install -r requirements.txt
  1. Configure API Key
echo "NEWS_API_KEY=your_api_key_here" > .env

๐ŸŽฏ Usage

Option 1: Run Full Pipeline

Execute complete analysis pipeline:

# Step 1: Fetch news articles
python src/fetch_articles.py

# Step 2: Scrape full content
python src/scrape_articles.py

# Step 3: Clean and preprocess
python src/preprocess_articles.py

# Step 4: Sentiment analysis
python src/analyze_sentiment.py

# Step 5: Topic modeling
python src/discover_topics.py

# Step 6: Generate summaries
python src/summarize_articles.py

# Step 7: Evaluate pipeline quality
python src/evaluate_pipeline.py

Option 2: Run Dashboard Only

streamlit run dashboard/streamlit_app.py

Access dashboard at: http://localhost:8501

๐Ÿ“ Project Structure

news-trend-analysis/
โ”œโ”€โ”€ ๐Ÿ“‚ src/                          # Core processing pipeline
โ”‚   โ”œโ”€โ”€ fetch_articles.py            # NewsData.io API integration
โ”‚   โ”œโ”€โ”€ scrape_articles.py           # Web scraper (newspaper3k)
โ”‚   โ”œโ”€โ”€ preprocess_articles.py       # Text cleaning & filtering
โ”‚   โ”œโ”€โ”€ analyze_sentiment.py         # FinBERT sentiment inference
โ”‚   โ”œโ”€โ”€ discover_topics.py           # BERTopic clustering
โ”‚   โ”œโ”€โ”€ summarize_articles.py        # DistilBART summarization
โ”‚   โ””โ”€โ”€ evaluate_pipeline.py         # Quality metrics & reporting
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ dashboard/
โ”‚   โ”œโ”€โ”€ streamlit_app.py             # Interactive Streamlit dashboard
โ”‚   โ””โ”€โ”€ predictive_components.py     # Predictive analytics UI components
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ features/                     # Feature engineering module
โ”‚   โ””โ”€โ”€ time_features.py             # Time series feature generation
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ models/
โ”‚   โ”œโ”€โ”€ topic_model/                 # Saved BERTopic models
โ”‚   โ””โ”€โ”€ predictive/                  # Predictive ML models
โ”‚       โ”œโ”€โ”€ weekly_forecaster.py     # Elastic Net + XGBoost forecaster
โ”‚       โ”œโ”€โ”€ spike_detector.py        # Anomaly/spike detection
โ”‚       โ””โ”€โ”€ model_trainer.py         # Unified training pipeline
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ api/
โ”‚   โ””โ”€โ”€ prediction_api.py            # FastAPI REST endpoints
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ config/
โ”‚   โ””โ”€โ”€ config.yaml                  # Centralized configuration
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ tests/                        # Test suite
โ”‚   โ”œโ”€โ”€ test_features.py             # Feature engineering tests
โ”‚   โ”œโ”€โ”€ test_models.py               # Model tests
โ”‚   โ””โ”€โ”€ test_api.py                  # API integration tests
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ data/
โ”‚   โ”œโ”€โ”€ raw/                         # Raw JSON from API
โ”‚   โ”‚   โ”œโ”€โ”€ news_*.json              # Fetched articles
โ”‚   โ”‚   โ””โ”€โ”€ articles_scraped.json    # Scraped content
โ”‚   โ””โ”€โ”€ processed/                   # Processed CSV datasets
โ”‚       โ”œโ”€โ”€ articles.csv
โ”‚       โ”œโ”€โ”€ articles_with_sentiment.csv
โ”‚       โ”œโ”€โ”€ articles_with_topics.csv
โ”‚       โ””โ”€โ”€ articles_with_summary.csv
โ”‚
โ”œโ”€โ”€ ๐Ÿ“‚ .github/workflows/
โ”‚   โ””โ”€โ”€ daily-update.yml             # Automated pipeline (2x daily)
โ”‚
โ”œโ”€โ”€ run_pipeline.py                  # Complete pipeline runner
โ”œโ”€โ”€ requirements.txt                 # Lightweight deps (dashboard)
โ”œโ”€โ”€ requirements.full.txt            # Full deps (NLP pipeline)
โ””โ”€โ”€ README.md

๐ŸŽจ Dashboard Features

๐Ÿ“Š Metrics & Visualizations

Feature Description
Total Articles Count of all processed articles
Unique Articles Articles after duplicate removal
Sentiment Distribution Pie chart of positive/neutral/negative
Topics by Article Bar chart of topic distribution
Sentiment Over Time Line chart tracking sentiment trends

๐ŸŽ›๏ธ Interactive Controls

  • Sentiment Filter: Show only positive/neutral/negative articles
  • Topic Filter: Filter by specific topic cluster
  • Sort Order: Newest first / Oldest first
  • Pagination: 5-50 articles per page
  • Duplicate Toggle: Show/hide cross-source duplicates

โš™๏ธ Configuration

Data Quality Filters

Articles are automatically filtered based on:

Filter Threshold Reason
Minimum Words 200+ words Ensures substantive content for analysis
Paid Content Excluded Removes "ONLY AVAILABLE IN PAID PLANS"
Duplicate Titles Optional Toggle to show/hide cross-source articles

Models

Task Model Source Why This Model?
Sentiment ProsusAI/finbert HuggingFace Fine-tuned on financial news (better for economic articles than Twitter-based models)
Embeddings all-MiniLM-L6-v2 Sentence-Transformers Fast, efficient semantic embeddings
Summarization sshleifer/distilbart-cnn-12-6 HuggingFace Compressed BART trained on CNN news articles
Topic Modeling BERTopic + HDBSCAN Custom configuration Unsupervised clustering with auto-generated labels

๐Ÿšข Deployment

Railway (Dashboard)

# Automatic deployment on git push
# Uses: requirements.txt (lightweight)
# Environment variables: NEWS_API_KEY

GitHub Actions (Pipeline)

# Runs twice daily: 8:00 AM & 8:00 PM UTC
# Uses: requirements.full.txt (full NLP)
# Commits processed data back to repo

๐Ÿ“Š Data Flow

NewsData.io API โ†’ fetch_articles.py โ†’ scrape_articles.py โ†’ preprocess_articles.py 
                                                                      โ†“
                        evaluate_pipeline.py โ† summarize_articles.py โ† discover_topics.py โ† analyze_sentiment.py
                                  โ†“
                            Dashboard (Streamlit)

๐Ÿ”ฌ Analysis Details

Why Pre-trained Models? (No Training Required)

This project uses transfer learning - applying pre-trained models rather than training from scratch. This approach is:

  1. Industry Standard: Pre-trained transformers (FinBERT, BART) are trained on billions of tokens
  2. More Accurate: FinBERT trained on 4.9M financial sentences vs. our 55 articles
  3. Practical: Training BERT from scratch requires 4 TPUs for 4 days (~$500-1000)
  4. Academic: Demonstrates proper use of state-of-the-art NLP (BERT, transformers)

Models Used:

  • FinBERT: Fine-tuned BERT for financial sentiment (Prosus AI)
  • DistilBART: Distilled BART for news summarization (trained on CNN/DailyMail)
  • BERTopic: Unsupervised topic discovery (no training needed)

Sentiment Analysis

  • Model: Fine-tuned RoBERTa with custom adapter (default)
    • Base: cardiffnlp/twitter-roberta-base-sentiment-latest
    • Adapter: Custom fine-tuned on domain-specific data
  • Output: Positive, Neutral, Negative (with confidence scores)
  • Inference: Batch processing on CPU
  • Advantages:
    • Domain-specific fine-tuning for better accuracy
    • 69% non-neutral classification (vs 35% for base model)
    • Better detection of subtle sentiment in news articles

Using the Adapter:

# Default: Uses adapter automatically if available
python src/analyze_sentiment.py

# Force base model
python src/analyze_sentiment.py --no-adapter

# Compare models
python compare_models.py --adapter-path ./models/sentiment_adapter_best

For more details, see README_ADAPTER.md

Topic Modeling

  • Algorithm: HDBSCAN clustering on sentence embeddings
  • Dimensionality Reduction: UMAP
  • Labels: Auto-generated using KeyBERT (unsupervised)
  • Dynamic: Discovers new topics as articles grow (no predefined categories)

Summarization

  • Model: DistilBART (compressed BART for efficiency)
  • Length: 30-130 tokens per summary
  • Quality: Requires 200+ word articles
  • Batch Processing: Handles 8 articles simultaneously

๐Ÿ“Š Results & Evaluation

Overall Pipeline Quality: 85/100 (GOOD)

Sentiment Analysis Results:

๐Ÿ“‰ Negative:  10 articles (18.2%)
โšช Neutral:   28 articles (50.9%)
๐Ÿ“ˆ Positive:  17 articles (30.9%)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Average Confidence: 76.0%

Key Improvement: Switched from Twitter-RoBERTa (68% confidence, 1.8% negative detection) to FinBERT (76% confidence, 18.2% negative detection) for better financial news understanding.

Topic Discovery Results:

Discovered 6 coherent topics:
1. Korea_Trump_China           - 16 articles (29%)
2. Economic_Sector_Sustainable - 9 articles (16%)
3. Inflation_Forecasts         - 8 articles (15%)
4. Tourism_Travel              - 7 articles (13%)
5. Reforms_Economy             - 6 articles (11%)
6. Business_Development        - 4 articles (7%)

Topic Quality Score: 100/100

Summarization Performance:

โœ“ Coverage: 100% (all articles summarized)
โœ“ Avg length: 51 words
โœ“ Compression ratio: 37.7x (from ~800 โ†’ 51 words)
โœ“ Processing time: ~5s per article (CPU)

Evaluation Metrics (6 categories):

  1. Data Quality (30% weight): 100/100
    • 100% completeness, proper filtering
  2. Sentiment Balance (15% weight): 0/100
    • Expected imbalance (economic news naturally more negative/neutral)
  3. Topic Quality (25% weight): 100/100
    • Coherent clusters, balanced distribution
  4. Summarization (30% weight): 100/100
    • Full coverage, appropriate compression
  5. Temporal Analysis: 2-day coverage with automated updates
  6. Confidence Tracking: Real-time monitoring of prediction reliability

Business Insights:

  • US-China relations dominate economic news (29%)
  • Sustainability emerging as major economic theme
  • Inflation concerns persist across multiple articles
  • Tourism sector showing recovery signals

๐Ÿ”ฎ Predictive Analytics

Overview

The predictive analytics module provides ML-based forecasting capabilities:

Feature Description
Weekly Forecaster Predicts avg sentiment and article volume 7 days ahead
Spike Detector Identifies anomalous news activity with probability scores
Feature Engineering Automated time series feature generation
REST API FastAPI endpoints for programmatic access

Models

1. Weekly Forecaster (Dual-Model Approach)

  • Elastic Net (interpretable baseline): Linear model with L1+L2 regularization
  • XGBoost (high accuracy): Gradient boosting for complex patterns
  • Targets: avg_sentiment and total_articles
  • Horizon: 7 days

2. Spike Detector

  • Algorithm: XGBoost Classifier
  • SMOTE Balancing: Handles class imbalance (spikes are rare)
  • Definition: volume > mean + 2ฯƒ OR sentiment_change > 0.5
  • Output: Probability (0-1) + Risk level (MINIMAL/LOW/MEDIUM/HIGH)

Feature Engineering

Automatically generates 50+ features from daily aggregates:

Category Features
Lag Features 1, 2, 3, 7, 14 day lags for sentiment/volume
Rolling Features Mean, std, min, max over 3, 7, 14, 30 day windows
Calendar Features Day of week, weekend, month, Croatian holidays
Trend Features Momentum, acceleration, trend direction

API Endpoints

# Weekly predictions
GET /api/predictions/weekly

# Spike probability
GET /api/predictions/spike-probability

# Trend analysis
GET /api/analytics/trends?period=30

# Daily aggregates
GET /api/data/daily-aggregates?days=7

# Retrain models (protected)
POST /api/models/retrain

Start API Server:

uvicorn api.prediction_api:app --reload --port 8000

Training Pipeline

from models.predictive.model_trainer import ModelTrainer
from features.time_features import TimeSeriesFeatureEngineer
import pandas as pd

# Load data
df = pd.read_csv('data/processed/articles_with_sentiment.csv')

# Generate features
engineer = TimeSeriesFeatureEngineer()
features_df = engineer.create_all_features(df)

# Train all models
trainer = ModelTrainer(n_splits=5, use_optuna=True)
results = trainer.train_all_models(features_df)

# Save models
trainer.save_models('models/predictive/')

Time Series Validation

Uses walk-forward validation (TimeSeriesSplit) instead of random split:

Training: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘] โ†’ Test: [โ–‘โ–‘โ–‘โ–‘]
Training: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘] โ†’ Test: [โ–‘โ–‘โ–‘โ–‘]
Training: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘] โ†’ Test: [โ–‘โ–‘โ–‘โ–‘]

This prevents data leakage from future observations.

Dashboard Integration

The Streamlit dashboard includes a Predictive Analytics tab with:

  • ๐Ÿ“Š Predicted vs Actual charts
  • ๐Ÿ”” Spike probability gauge
  • ๐Ÿ“ˆ Feature importance visualization
  • โš ๏ธ Real-time spike alerts

Configuration

All hyperparameters in config/config.yaml:

models:
  weekly_forecaster:
    forecast_horizon: 7
    elastic_net:
      alpha: 1.0
      l1_ratio: 0.5
    xgboost:
      n_estimators: 100
      max_depth: 6
      learning_rate: 0.1
  
  spike_detector:
    volume_std_threshold: 2.0
    sentiment_change_threshold: 0.5
    use_smote: true

๐Ÿงช Testing

Run the test suite:

# All tests
pytest tests/ -v

# Feature engineering tests
pytest tests/test_features.py -v

# Model tests
pytest tests/test_models.py -v

# API integration tests
pytest tests/test_api.py -v

๐Ÿค Contributing

This is a university project, but suggestions are welcome!

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/improvement)
  3. Commit changes (git commit -m 'Add feature')
  4. Push to branch (git push origin feature/improvement)
  5. Open Pull Request

๐Ÿ“ License

MIT License - feel free to use for educational purposes

๐Ÿ‘จโ€๐Ÿ’ป Author

Built as a university project demonstrating:

  • Automated NLP pipeline design
  • Real-time data visualization
  • Cloud deployment (Railway + GitHub Actions)
  • Modern Python best practices

๐Ÿ™ Acknowledgments

  • NewsData.io for free news API access
  • HuggingFace for pre-trained transformer models
  • Streamlit for rapid dashboard development

โญ Star this repo if you found it helpful for your own projects!

About

Automated NLP pipeline for news analysis with sentiment detection, topic modeling & summarization. Daily updates via GitHub Actions, 24/7 Streamlit dashboard on Railway.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages