- Reddit Data Analysis Project for SFI Complexity Global School 2024
This project involves extracting data (more than 2TB) from Reddit, preprocessing it, and analyzing it using machine learning models to understand discussions around AI and employment. The analysis includes data collection, preprocessing, model training, and topic modeling. It was done using the Old Dominion University HPC. This project was part of the SFI CGS 2024 project.
- Due to API limitations, we used the Reddit data archive accessible via torrents.
- Data spans from 2005 to 2023, focusing on posts from July 2022 to July 2024.
- Downloaded using academic torrents and processed using scripts from the PushshiftDumps repository.
- Total number of posts extracted: 7,616,585.
- Subreddits: anti-work, AskReddit, careerguidance, changemyview, Economics, Futurology, jobs, NoStupidQuestions, Showerthoughts, technology.
- Query: Filtered posts discussing AI's impact on jobs.
- Used scripts from the PushshiftDumps repository.
- Decompressed and iterated over zst compressed files.
- Converted compressed files into CSV and queried them for specific subreddits and months.
- Manually labeled 555 Reddit posts into three categories:
- C1 Work: Task-oriented discussions.
- C2 Worker: Impact on individual workers.
- C3 Workforce: Impact on large groups or sectors.
- Random Forest Classifier
- Support Vector Machine (SVM)
- Text data cleaned and transformed using TF-IDF vectorization.
- Models trained using multilabel classification.
- Random Forest performed better for C2 Worker category.
- SVM showed limited effectiveness across all categories.
- Time series analysis to identify trends over time.
- Sentiment analysis using TextBlob to assess emotional tone.
- Conducted using Latent Dirichlet Allocation (LDA) on various datasets.
- Analyzed topics for C1 Work, C2 Worker, and C3 Workforce categories.
-
Number of posts matching queries: 7,616,585.
-
Posts labeled by Llama 3.1 7B: 18,159.
-
Manual labeling identified 67 in C1, 192 in C2, and 80 in C3.
-
AI's impact on employment is a central theme.
-
Discussions on human vs. AI creativity, especially in art.
-
Ethical and societal considerations are significant topics.
📄 Human Agency in Automated Futures - Full research report detailing methodology, findings, and conclusions.
- Use Torrent to download the data (predicted to be ~2TB).
- Use the command
at-get 9c263fc85366c1ef8f5bb9da0203f4c8c8db75f4to download the data from the 2005-2023 dataset. - 2005-2023
- All (until 2024-07)
- Command used to split files in Windows: https://github.com/anseki/split-win/tree/master. Download the .cmd and .ps1 files to System32/.
- Use the command
split D:\REDDIT_DATA\reddit\submissions\RS_2023-12.zst -size 1gbto split the files. HPC's limit is 10GB, uploading 1GB parts works the best. - In the HPC Create a folder with the name of the file and upload all the parts there. For instance
RS_2023-12/ - Use the command
cat /home/jmart130/GitHub/SFI_CGS_2024/data/reddit/submissions/RS_2023-12/* > /home/jmart130/GitHub/SFI_CGS_2024/data/reddit/submissions/RS_2023-12.zstto join the files in the HPC files.
Use the files from this repository to preprocess the data.
1. single_file.py decompresses and iterates over a single zst compressed file.
2. iterate_folder.py does the same, but for all files in a folder.
3. Queries:
- Getting 1 subreddit: python combine_folder_multiprocess.py reddit/submissions --field subreddit --value careerguidance --output pushshift2 --processes 20 --file_filter "^RS_2024-07"
- Getting all subreddits: python combine_folder_multiprocess.py reddit/submissions --field subreddit --value anti-work,AskReddit,careerguidance,changemyview,Economics,Futurology,jobs,NoStupidQuestions,Showerthoughts,technology --output all_subreddits_2024-04 --file_filter "^RS_2024-04"
4. Use filter_file.py to convert the compressed .zst files into csv.
5. Use query_csv.py to query the csv files for each month and subreddit into only one.
6. Use get_sample.py to get a sample of the data either by month or by subreddit (proportionally).
- Improve lemmatization and stopword reduction.
- Check error metrics (BIC, AIC) for topic merging.
- Tokenize using bigrams and trigrams.
- Document hyperparameter selection for models.

