Website : https://radar-gamma.vercel.app/
-
Download the missing dataset files, and include them in the project directory following the file structure.
File Description Link Size soc-redditHyperlinks-body.tsvNetwork of subreddit-to-subreddit hyperlinks extracted from hyperlinks in the body of the post. SNAP 304 MB soc-redditHyperlinks-title.tsvNetwork of subreddit-to-subreddit hyperlinks extracted from hyperlinks in the title of the post. SNAP 352 MB spreadspoke_scores.csv Scores of all NFL games since 1966 Scores
## Quickstart
# Cloner le dépôt
git clone https://github.com/epfl-ada/ada-2025-project-radar.git
cd ada-2025-project-radar
# Créer un environnement virtuel (Python 3.x)
python3 -m venv .venv
# Activer l’environnement virtuel
# Sous Linux / macOS :
source .venv/bin/activate
# Sous Windows (PowerShell) :
.venv\Scripts\Activate.ps1
# Mettre à jour pip
pip install --upgrade pip
# Installer les dépendances depuis le fichier requirements
pip install -r pip_requirements.txtThe directory structure of new project looks like this:
├── data <- Project data files
│
├── src <- Source code
│ ├── data <- Data directory
│ ├── models <- Model directory
│ ├── utils <- Utility directory
│ ├── scripts <- Shell scripts
│
├── tests <- Tests of any kind
│
├── results.ipynb <- a well-structured notebook showing the results
│
├── .gitignore <- List of files ignored by git
├── pip_requirements.txt <- File for installing python dependencies
└── README.md
In this project, we take on the role of Reddit sociologists aiming to understand how users respond to real-life events. Using the Reddit HyperLink dataset, we seek to uncover hidden links between these events and the sentiment expressed across communities.
We will:
- Correlate major events (e.g. financial crashes, elections) with changes in sentiment across different subreddit clusters.
- Analyze specific communities, such as economic or political subreddits, to observe how sentiment evolves around relevant real-world events.
- Reverse the perspective by testing whether changes in Reddit sentiment can predict upcoming real-life events, such as a sports team losing a game or stock prices dropping.
This project demonstrates how large-scale Reddit interactions can be used as a lens to study collective reactions to real-world events. By linking major societal, political, and economic events to shifts in sentiment across subreddit communities, we show that Reddit is not only a platform for discussion but also a dynamic reflection of public mood and attention. Our analysis highlights how different communities respond in distinct ways to different kinds of events, revealing the role of community structure and shared interests in shaping online discourse.
Moreover, exploring whether sentiment changes can anticipate real-life outcomes allows us to show even greater correlation between Reddit sentiment/activity and real-life events, as well as showing how collective behaviour can be used to highlight certain causes of an event that allow to detect it before it even happens.
- Can major real-life events be correlated to Reddit activity and post sentiment using the Reddit Hyperlink dataset?
- Can we predict events of different natures (finance, politics, sports, …) based on sentiment changes in clusters of related subreddits?
- Can different metrics from the dataset (fear, anger, number of uppercase characters, etc.) improve sentiment quantification depending on the community type?
- Can financial indicators (e.g. S&P500, USD rate, Bitcoin price) be correlated to sentiment in financial subreddits?
-
Reddit Embeddings
Provides pre-computed numerical vectors (embeddings) for each subreddit. Used for clustering subreddits by theme (e.g. finance, politics). We only used this in exploratory analysis of the datset. -
S&P 500 and Bitcoin Prices
Used to verify whether spikes in negative sentiment (e.g. anxiety) on finance-related subreddits predict market downturns. -
subreddit_list.txt
Parsed from r/ListOfSubreddits wiki to identify themes of 1600 subreddits. Used to label all subreddits after community clustering. We only used this in exploratory analysis of the datset. -
spreadscores.csv Used to get the scores of all NFL games
-
Mean_sentiment_nflteams.csv
Compiled from Convokit (Dataset compiling millions of posts on reddit.) We only used data from the subreddits of 11 NFL teams -
nflteams.txt : Compiled by hand, it is just a dictonary mapping NFL team names to their subreddit
We first identified major real-life events (e.g. NFL, Trump election) and looked for spikes in volume of posts in related subreddits (e.g. r/news, r/politics) around the dates of these events.
We used the LIWC lexicon to extract additional features from the posts, such as anxiety, anger, sadness, religion or positive/negative emotions. We then correlated these features to real life events.
Using the hyperlink structure of the dataset, we looked around the elections of 2016 for interactions between left-leaning and right-leaning subreddits. We looked at both the volume of interactions and the sentiment of these interactions to see how they evolved around the elections (and if they showed hinting towards the results).
We looked at the winning percetange for all NFL teams across the 2015 and 2016 seasons using the spreadscores.csv file
To highlight differences in the vocabuary used in subreddits of winning and losing teams, we chose to use the LIWC categories with highest relative frequencies, i.e. the ones that deviated the most from the average on the entire dataset
Since we didn't enough posts in the SNAP dataset for NFL teams, we chose to compile an additional dataset using the one provided by the CONVOKIT study. Only 11 subreddits of NFL teams were available, but with significatly more posts. We used a simple sentiment classifer on the text of each post that was posted on the day of/the day before/the day after an NFL game for the 11 teams. We then calculated the mean sentiment for each of the days/teams.
We chose to use a Random Tree Classifier to predict the results of NFL games using sentiment. We trained it on the sentiment before/after/on gamedays of the 2015 NFL season and tested it on the 2016 season.
We filtered the Reddit Hyperlink dataset to include only posts from or to finance and cryptocurrency related subreddits (50 subreddits total: 25 finance like investing, stocks, wallstreetbets, and 25 crypto like bitcoin, ethereum, dogecoin). This resulted in ~7,300 posts (2.5% of the dataset).
We downloaded S&P 500 and Bitcoin price data (Jan 2014 - April 2017) using the yfinance library to match the Reddit dataset time period. We calculated daily returns and 20-day rolling volatility for both markets.
We aggregated daily Reddit activity (post count) from finance/crypto subreddits and computed Pearson correlations with market volatility:
- S&P 500: r = -0.047 (no significant correlation)
- Bitcoin: r = +0.133 (weak positive correlation, statistically significant)
We trained several classifiers (Logistic Regression, Random Forest, Gradient Boosting, AdaBoost, Neural Network) to predict whether Bitcoin would go up or down the next day using Reddit sentiment features. Best accuracy achieved was ~61% (Logistic Regression), barely better than random guessing (50%).
Since ML failed at direction prediction, we tested a simpler approach: using Reddit activity spikes as signals to exit the market during high volatility periods. We exhaustively searched 3,024 strategy combinations. The best Bitcoin strategy achieved 1,442% return vs 300% Buy & Hold.
To check if these results were due to overfitting, we split the data temporally (60% training, 40% test) and validated. Only 4/10 top strategies generalized to unseen data, suggesting ~60% of "winning" strategies were fitting noise.
Note : we only had 4 members working on the project
-Arthur : Building the website, writing the data story
-Cassio : General part of the analyis (Trump Election, Paris attacks, SpaceX, ...), writing the data story
-Néhémie : NFL part of the analysis, writing the data story
-Rached : Finance part of the analysis, writing the data story