Data Story

Website : https://radar-gamma.vercel.app/

Run information

Download the missing dataset files, and include them in the project directory following the file structure.

File	Description	Link	Size
`soc-redditHyperlinks-body.tsv`	Network of subreddit-to-subreddit hyperlinks extracted from hyperlinks in the body of the post.	SNAP	304 MB
`soc-redditHyperlinks-title.tsv`	Network of subreddit-to-subreddit hyperlinks extracted from hyperlinks in the title of the post.	SNAP	352 MB
spreadspoke_scores.csv	Scores of all NFL games since 1966	Scores

## Quickstart

# Cloner le dépôt
git clone https://github.com/epfl-ada/ada-2025-project-radar.git
cd ada-2025-project-radar

# Créer un environnement virtuel (Python 3.x)
python3 -m venv .venv

# Activer l’environnement virtuel
# Sous Linux / macOS :
source .venv/bin/activate
# Sous Windows (PowerShell) :
.venv\Scripts\Activate.ps1

# Mettre à jour pip
pip install --upgrade pip

# Installer les dépendances depuis le fichier requirements
pip install -r pip_requirements.txt

Project Structure

The directory structure of new project looks like this:

├── data                        <- Project data files
│
├── src                         <- Source code
│   ├── data                            <- Data directory
│   ├── models                          <- Model directory
│   ├── utils                           <- Utility directory
│   ├── scripts                         <- Shell scripts
│
├── tests                       <- Tests of any kind
│
├── results.ipynb               <- a well-structured notebook showing the results
│
├── .gitignore                  <- List of files ignored by git
├── pip_requirements.txt        <- File for installing python dependencies
└── README.md

Reddit Event Sentiment Analysis

Abstract

In this project, we take on the role of Reddit sociologists aiming to understand how users respond to real-life events. Using the Reddit HyperLink dataset, we seek to uncover hidden links between these events and the sentiment expressed across communities.

We will:

Correlate major events (e.g. financial crashes, elections) with changes in sentiment across different subreddit clusters.
Analyze specific communities, such as economic or political subreddits, to observe how sentiment evolves around relevant real-world events.
Reverse the perspective by testing whether changes in Reddit sentiment can predict upcoming real-life events, such as a sports team losing a game or stock prices dropping.

This project demonstrates how large-scale Reddit interactions can be used as a lens to study collective reactions to real-world events. By linking major societal, political, and economic events to shifts in sentiment across subreddit communities, we show that Reddit is not only a platform for discussion but also a dynamic reflection of public mood and attention. Our analysis highlights how different communities respond in distinct ways to different kinds of events, revealing the role of community structure and shared interests in shaping online discourse.

Moreover, exploring whether sentiment changes can anticipate real-life outcomes allows us to show even greater correlation between Reddit sentiment/activity and real-life events, as well as showing how collective behaviour can be used to highlight certain causes of an event that allow to detect it before it even happens.

Research Questions

Can major real-life events be correlated to Reddit activity and post sentiment using the Reddit Hyperlink dataset?
Can we predict events of different natures (finance, politics, sports, …) based on sentiment changes in clusters of related subreddits?
Can different metrics from the dataset (fear, anger, number of uppercase characters, etc.) improve sentiment quantification depending on the community type?
Can financial indicators (e.g. S&P500, USD rate, Bitcoin price) be correlated to sentiment in financial subreddits?

Additional Datasets

Reddit Embeddings
Provides pre-computed numerical vectors (embeddings) for each subreddit. Used for clustering subreddits by theme (e.g. finance, politics). We only used this in exploratory analysis of the datset.
S&P 500 and Bitcoin Prices
Used to verify whether spikes in negative sentiment (e.g. anxiety) on finance-related subreddits predict market downturns.
subreddit_list.txt
Parsed from r/ListOfSubreddits wiki to identify themes of 1600 subreddits. Used to label all subreddits after community clustering. We only used this in exploratory analysis of the datset.
spreadscores.csv Used to get the scores of all NFL games
Mean_sentiment_nflteams.csv
Compiled from Convokit (Dataset compiling millions of posts on reddit.) We only used data from the subreddits of 11 NFL teams
nflteams.txt : Compiled by hand, it is just a dictonary mapping NFL team names to their subreddit

Methods

General analysis

Correlating events through spikes in volume

We first identified major real-life events (e.g. NFL, Trump election) and looked for spikes in volume of posts in related subreddits (e.g. r/news, r/politics) around the dates of these events.

Adding LIWC categories for sentiment analysis

We used the LIWC lexicon to extract additional features from the posts, such as anxiety, anger, sadness, religion or positive/negative emotions. We then correlated these features to real life events.

Mixing volume and sentiment with hyperlinks

Using the hyperlink structure of the dataset, we looked around the elections of 2016 for interactions between left-leaning and right-leaning subreddits. We looked at both the volume of interactions and the sentiment of these interactions to see how they evolved around the elections (and if they showed hinting towards the results).

NFL

Identification of winning and losing teams

We looked at the winning percetange for all NFL teams across the 2015 and 2016 seasons using the spreadscores.csv file

LIWC analyis

To highlight differences in the vocabuary used in subreddits of winning and losing teams, we chose to use the LIWC categories with highest relative frequencies, i.e. the ones that deviated the most from the average on the entire dataset

Additional dataset for sentiment

Since we didn't enough posts in the SNAP dataset for NFL teams, we chose to compile an additional dataset using the one provided by the CONVOKIT study. Only 11 subreddits of NFL teams were available, but with significatly more posts. We used a simple sentiment classifer on the text of each post that was posted on the day of/the day before/the day after an NFL game for the 11 teams. We then calculated the mean sentiment for each of the days/teams.

Prediction Model

We chose to use a Random Tree Classifier to predict the results of NFL games using sentiment. We trained it on the sentiment before/after/on gamedays of the 2015 NFL season and tested it on the 2016 season.

Finance

Subreddit Filtering

We filtered the Reddit Hyperlink dataset to include only posts from or to finance and cryptocurrency related subreddits (50 subreddits total: 25 finance like investing, stocks, wallstreetbets, and 25 crypto like bitcoin, ethereum, dogecoin). This resulted in ~7,300 posts (2.5% of the dataset).

Market Data

We downloaded S&P 500 and Bitcoin price data (Jan 2014 - April 2017) using the yfinance library to match the Reddit dataset time period. We calculated daily returns and 20-day rolling volatility for both markets.

Reddit Activity vs Market Volatility

We aggregated daily Reddit activity (post count) from finance/crypto subreddits and computed Pearson correlations with market volatility:

S&P 500: r = -0.047 (no significant correlation)
Bitcoin: r = +0.133 (weak positive correlation, statistically significant)

ML Price Direction Prediction

We trained several classifiers (Logistic Regression, Random Forest, Gradient Boosting, AdaBoost, Neural Network) to predict whether Bitcoin would go up or down the next day using Reddit sentiment features. Best accuracy achieved was ~61% (Logistic Regression), barely better than random guessing (50%).

Strategy Backtesting

Since ML failed at direction prediction, we tested a simpler approach: using Reddit activity spikes as signals to exit the market during high volatility periods. We exhaustively searched 3,024 strategy combinations. The best Bitcoin strategy achieved 1,442% return vs 300% Buy & Hold.

Overfitting Validation

To check if these results were due to overfitting, we split the data temporally (60% training, 40% test) and validated. Only 4/10 top strategies generalized to unseen data, suggesting ~60% of "winning" strategies were fitting noise.

Contribution within the team

Note : we only had 4 members working on the project

-Arthur : Building the website, writing the data story

-Cassio : General part of the analyis (Trump Election, Paris attacks, SpaceX, ...), writing the data story

-Néhémie : NFL part of the analysis, writing the data story

-Rached : Finance part of the analysis, writing the data story

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Story

Website : https://radar-gamma.vercel.app/

Run information

Project Structure

Reddit Event Sentiment Analysis

Abstract

Research Questions

Additional Datasets

Methods

General analysis

Correlating events through spikes in volume

Adding LIWC categories for sentiment analysis

Mixing volume and sentiment with hyperlinks

NFL

Identification of winning and losing teams

LIWC analyis

Additional dataset for sentiment

Prediction Model

Finance

Subreddit Filtering

Market Data

Reddit Activity vs Market Volatility

ML Price Direction Prediction

Strategy Backtesting

Overfitting Validation

Contribution within the team

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
data		data
public/plots		public/plots
src		src
tests		tests
.gitignore		.gitignore
NFL.ipynb		NFL.ipynb
README.md		README.md
cass.ipynb		cass.ipynb
finance.ipynb		finance.ipynb
pip_requirements.txt		pip_requirements.txt
results.ipynb		results.ipynb

epfl-ada/ada-2025-project-radar

Folders and files

Latest commit

History

Repository files navigation

Data Story

Website : https://radar-gamma.vercel.app/

Run information

Project Structure

Reddit Event Sentiment Analysis

Abstract

Research Questions

Additional Datasets

Methods

General analysis

Correlating events through spikes in volume

Adding LIWC categories for sentiment analysis

Mixing volume and sentiment with hyperlinks

NFL

Identification of winning and losing teams

LIWC analyis

Additional dataset for sentiment

Prediction Model

Finance

Subreddit Filtering

Market Data

Reddit Activity vs Market Volatility

ML Price Direction Prediction

Strategy Backtesting

Overfitting Validation

Contribution within the team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages