This project focuses on predicting the daily realized volatility of the S&P 500 Index using various econometric and machine learning models. The research explores the effectiveness of traditional models, such as GARCH and VAR, compared to more advanced machine learning techniques, including LSTM and RNN.
-
Data Collection and Preprocessing: The dataset combines daily data and macrofinancial indicators, alongside sentiment analysis of financial news headlines related to S&P 500 companies.
-
Model Development: Univariate and multivariate models were developed, including GARCH, VAR, Linear Regression, Ridge, Lasso, Random Forest, XGBoost, Simple RNN, and LSTM. The models were trained and tested using robust evaluation metrics like R², RMSE, MAE, and MAPE.
-
Sentiment Analysis: The project incorporates sentiment scores derived from the VADER model applied to news headlines, enhancing the predictive power of the models.
-
Results and Comparison: The project systematically compares the out-of-sample forecasting performance of the models, highlighting the superior performance of LSTM in capturing complex temporal patterns in financial time series data.
This research provides valuable insights into the application of machine learning in finance, particularly in volatility forecasting, and demonstrates the potential of integrating sentiment analysis with traditional financial modeling techniques.
The requirements are inside the requirements.txt file.
This project can be sub-divided into these main parts:
- Download Market Capitalization data from CRSP, sourced from Wharton
- Download indicators
- Sentimental Analysis
- Models
Data was downloaded from these websites:
- CRSP was used for Market Capitalization of SP500 and the first 50 firms of the index
- Yahoo Finance was used for Indicators
- Economic Policy Uncertainty was used for indicators
The sentimental analysis can be divided into these main passages:
- Scrap all information regarding SP500 from Wikipedia. Subsequently, scrap all news from Business Market Insider. Everything is done in scraper.py
- Run the sentiment on all scraped headlines using sentimental.py
- Plots and analysis of of data scraped in the parts above in plots_SP500.ipynb
- Analysis sentimental scores in sentimental_and_plots.ipynb
- Computation of extra weight which will be given to the first 50 firms of the SP500. Code is visible in market_capitalization_weights.ipynb
- Adjust sentiment scores taking into account the weights computed previously. Run compute_weighted_sentiments.ipynb
Due to GitHub's storage limitations, some CSV files which were used cannot be uploaded. However, they can still be visualized here.
In order to not encounter any problems with the codes, they have to be saved in a Directory called Data. For instance, the CSV file sp500_news_and_sentimental.csv must be present in Data\sp500_news_and_sentimental.csv, as well as all the other CSV files.
Before running financial econometrics and ML's models, we need to scrap financial data from online and to subsequently merge it with the data from Oxford-Man Institute’s and TwelveData. All steps can be visualized in RV dataset.ipynb. Then, data analysis is computed in Data Analysis and Visualization.ipynb
Now, after data analysis was done, we ran the models:
- Regression models: Linear, Lasso, Ridge. Code can be visualized in Regression.ipynb.
- Financial econometrics models (GARCH and VAR), with the testing of all their assumptions. The code can be found in GARCH_VAR.ipynb
- Lastly, ML models (Random Forest, XGBoost, RNN, LSTM) were executed in ML models.ipynb.
All model results can be observed in Results.ipynb.
For any question and/or curiosity, feel free to reach