This repository provides an end‑to‑end pipeline—data cleaning, feature engineering, classical econometrics, sentiment fusion, and deep learning—for NVIDIA (NVDA) price forecasting.
- Forecast NVDA prices with ARIMA, SARIMAX, GARCH, and a hybrid Random‑Forest ➔ Bidirectional LSTM + Attention network.
- Explain drivers such as S&P 500, Google, Microsoft, Intel, macro factors, and a news‑based
ImpactScore. - Compare each layer of model complexity against a simple T‑2 baseline (price from two days prior).
Below is a list of each data file contained in the Data directory:
- Backup.csv
- Database.csv.csv
- cleaned_data.csv
- merged_data.csv
- merged_data2.csv
- merged_with_impact.csv
- merged_with_impact_score.csv
- nvidia_events_filtered.csv
Each file is used at different stages of the data processing and analysis pipeline. Please refer to the specific sections of the project documentation for details on how each file is utilized
conda create -n nvda_env python=3.9
conda activate nvda_env
pip install -r requirements.txtKey libraries – pandas • numpy • scikit-learn • statsmodels • arch • tensorflow (keras)
git clone https://github.com/YourUsername/NVIDIA-Forecasting.git
cd NVIDIA-Forecastingmkdir -p data
# place cleaned_data.csv and merged_with_impact_score.csv herepython "Total Code.py"| Step | Script | Core Techniques |
|---|---|---|
| ① Pre‑processing | src/data_prep.py |
drop NA/∞, StandardScaler, add 20‑/80‑day MA |
| ② Feature ranking | src/feature_select.py |
Pearson corr, PCA 95 %, LassoCV, Random Forest |
| ③ ARIMA | src/arima.py |
ADF test → ARIMA(4,1,0) rolling forecast |
| ④ SARIMAX & GARCH | src/sarimax_garch.py |
exog = [SP500_log, ImpactScore, INTC_ret] |
| ⑤ Sentiment scrape | src/news_sentiment.py |
VADER / FinBERT → daily ImpactScore |
| ⑥ Deep model | src/lstm_attention.py |
RF meta‑feature ➔ 2×50 bi‑LSTM + Attention |
| ⑦ Plots | src/visualization.py |
saves all figures to docs/img/ |
We start by removing NaN/Inf rows and standardising all numerical columns.
Two complementary feature‑ranking tracks are applied:
| Technique | Purpose | Outcome |
|---|---|---|
| PCA (95 % var) | Orthogonalise & compress | 9 principal components retained |
| LassoCV + Random Forest | Sparse, non‑parametric importance | Top drivers: SP500, MSFT_Adj_Close, ImpactScore, 20‑/80‑day MA |
These features feed every downstream model to ensure consistency and avoid look‑ahead bias.
The ADF test rejects the unit‑root hypothesis after one differencing, leading to an (4,1,0) specification chosen via AIC grid search.
The autocorrelation structure is visualised below:
Both plots confirm strong short‑memory up to three lags, justifying the AR term.
Seasonality (12‑month) and exogenous regressors—SP500_log, ImpactScore, INTC_ret—are introduced in a SARIMAX(1,0,1)(0,1,1,12) framework.
Left panel: fitted vs. observed shows tight tracking.
Right panel: standardised residuals & Q‑Q plot indicate near‑normality with mild tail risk—later captured by GARCH.
A GARCH(1,1) layer is fitted to log‑return residuals, reducing volatility clustering and yielding a log‑likelihood of −2156 (↑ vs. ARIMA).
Daily news headlines are scored by VADER and FinBERT; scores are averaged into an ImpactScore that enters SARIMAX and LSTM as a leading indicator.
Scatter illustrates a Pearson‑r = 0.43 between ImpactScore and next‑day return.
Event overlay shows price jumps aligning with major positive (green) and negative (red) news.
- Random Forest predicts next‑day price to create a meta‑feature.
- Neural net architecture:
Input (20 features, T‑2 window)
└─ bi‑LSTM (50) ─┐
└─ bi‑LSTM (50) ─┘ → Attention → LSTM (50) → Dense(1)
- Regularisation: Dropout 0.3, L2 0.01, EarlyStopping (patience 10).
The plot compares Actual (blue), LSTM (red), and the naive T‑2 baseline (green).
| Model | Inputs | Best Test Metric | T‑2 Baseline |
|---|---|---|---|
| ARIMA(4,1,0) | Close | MSE 34.2 | — |
| SARIMAX | Close + exog | MSE 22.5 | — |
| GARCH(1,1) | log‑σ² | LLH ↑ −2156 | — |
| LSTM‑Attention | 20 features, time_steps = 2 | MAE 2.42 | 3.44 |
- 32 % MAE reduction from baseline to LSTM.
- SARIMAX halves ARIMA’s error by injecting macro + sentiment.
- Residual heavy tails in SARIMAX are mostly neutralised after the GARCH layer.
- Xiao, Q., & Ihnaini, B. (2023). Stock trend prediction using sentiment analysis. PeerJ Computer Science.
- Yahoo Finance – NVDA
- Mnih, V. et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. arXiv:1607.01958.
Thank you for visiting this project. If you have any questions or suggestions, feel free to open an issue or submit a pull request.






