diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/k-fold-cross-validation.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/k-fold-cross-validation.mdx index e69de29..5cf6d0a 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/k-fold-cross-validation.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/k-fold-cross-validation.mdx @@ -0,0 +1,120 @@ +--- +title: K-Fold Cross-Validation +sidebar_label: K-Fold Cross-Validation +description: "Mastering robust model evaluation by rotating training and testing sets to maximize data utility." +tags: [machine-learning, model-evaluation, cross-validation, k-fold, generalization] +--- + +While a [Train-Test Split](./train-test-split) is a great starting point, it has a major weakness: your results can vary significantly depending on which specific rows end up in the test set. + +**K-Fold Cross-Validation** solves this by repeating the split process multiple times and averaging the results, ensuring every single data point gets to be part of the "test set" at least once. + +## 1. How the Algorithm Works + +The process follows a simple rotation logic: +1. **Split** the data into **K** equal-sized "folds" (usually $K=5$ or $K=10$). +2. **Iterate:** For each fold $i$: + * Treat Fold $i$ as the **Test Set**. + * Treat the remaining $K-1$ folds as the **Training Set**. + * Train the model and record the score. +3. **Aggregate:** Calculate the mean and standard deviation of all $K$ scores. + +## 2. Visualizing the Process + +```mermaid +graph TB + TITLE["$$\text{K-Fold Cross-Validation}$$"] + + %% Dataset + TITLE --> DATA["$$\text{Full Dataset}$$"] + + %% Folds + DATA --> F1["$$\text{Fold 1}$$"] + DATA --> F2["$$\text{Fold 2}$$"] + DATA --> F3["$$\text{Fold 3}$$"] + DATA --> Fk["$$\text{Fold } k$$"] + + %% Iterations + F1 --> I1["$$\text{Iteration 1}$$
$$\text{Validation: Fold 1}$$
$$\text{Training: Others}$$"] + F2 --> I2["$$\text{Iteration 2}$$
$$\text{Validation: Fold 2}$$
$$\text{Training: Others}$$"] + F3 --> I3["$$\text{Iteration 3}$$
$$\text{Validation: Fold 3}$$
$$\text{Training: Others}$$"] + Fk --> Ik["$$\text{Iteration } k$$
$$\text{Validation: Fold } k$$
$$\text{Training: Others}$$"] + + %% Model Training & Evaluation + I1 --> M1["$$\text{Train Model}$$"] + I2 --> M2["$$\text{Train Model}$$"] + I3 --> M3["$$\text{Train Model}$$"] + Ik --> Mk["$$\text{Train Model}$$"] + + M1 --> S1["$$\text{Score}_1$$"] + M2 --> S2["$$\text{Score}_2$$"] + M3 --> S3["$$\text{Score}_3$$"] + Mk --> Sk["$$\text{Score}_k$$"] + + %% Final Result + S1 --> AVG["$$\text{Average Score}$$"] + S2 --> AVG + S3 --> AVG + Sk --> AVG + + AVG --> PERF["$$\text{Cross-Validated Performance}$$"] + +``` + +## 3. Why Use K-Fold? + +### A. Reliability (Reducing Variance) + +By averaging 10 different test scores, you get a much more stable estimate of how the model will perform on new data. It eliminates the "luck of the draw." + +### B. Maximum Data Utility + +In a standard split, 20% of your data is never used for training. In K-Fold, every data point is used for training $K-1$ times and for testing exactly once. This is especially vital for small datasets. + +### C. Hyperparameter Tuning + +K-Fold is the foundation for **Grid Search**. It helps you find the best settings for your model (like the depth of a tree) without overfitting to one specific validation set. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.model_selection import cross_val_score, KFold +from sklearn.ensemble import RandomForestClassifier + +# 1. Initialize model and data +model = RandomForestClassifier() + +# 2. Define the K-Fold strategy +kf = KFold(n_splits=5, shuffle=True, random_state=42) + +# 3. Perform Cross-Validation +# This returns an array of 5 scores +scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy') + +print(f"Scores for each fold: {scores}") +print(f"Mean Accuracy: {scores.mean():.4f}") +print(f"Standard Deviation: {scores.std():.4f}") + +``` + +## 5. Variations of Cross-Validation + +* **Stratified K-Fold:** Used for imbalanced data. It ensures each fold has the same percentage of samples for each class as the whole dataset. +* **Leave-One-Out (LOOCV):** A extreme case where $K$ equals the total number of samples ($N$). Extremely computationally expensive but uses the most data possible. +* **Time-Series Split:** Unlike random K-Fold, this respects the chronological order of data (Training on the past, testing on the future). + +## 6. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Robustness:** Provides a more accurate measure of model generalization. | **Computationally Expensive:** Training the model $K$ times takes $K$ times longer. | +| **Confidence:** The standard deviation tells you how "stable" the model is. | **Not for Big Data:** If your model takes 10 hours to train, doing it 10 times is often impractical. | + +## References + +* **Scikit-Learn:** [Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html) +* **StatQuest:** [K-Fold Cross-Validation Explained](https://www.youtube.com/watch?v=fSytzGwwBVw) + +--- + +**Now that you have a robust way to validate your model, how do you handle data where the classes are heavily skewed (e.g., 99% vs 1%)?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/loocv.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/loocv.mdx index e69de29..e0a8b8b 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/loocv.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/loocv.mdx @@ -0,0 +1,129 @@ +--- +title: "Leave-One-Out Cross-Validation (LOOCV)" +sidebar_label: LOOCV +description: "The most exhaustive validation technique: training on N-1 samples and testing on a single observation." +tags: [machine-learning, model-evaluation, loocv, cross-validation, small-data] +--- + +**Leave-One-Out Cross-Validation (LOOCV)** is an extreme case of [K-Fold Cross-Validation](./k-fold-cross-validation). Instead of splitting the data into 5 or 10 groups, LOOCV sets $K$ equal to $N$, the total number of data points in your set. + +In each iteration, the model is trained on every data point except **one**, which is used as the test set. + +## 1. How the Algorithm Works + +If you have a dataset with $n$ samples: +1. **Select** the first sample to be the test set. +2. **Train** the model on the remaining $n-1$ samples. +3. **Evaluate** the model on the single test sample and record the error. +4. **Repeat** this process $n$ times, so that each sample serves as the test set exactly once. +5. **Average** the $n$ resulting errors to get the final performance metric. + +## 2. Mathematical Representation + +The LOOCV estimate of the test error is the average of these $n$ test errors: + +$$ +CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i +$$ + +Where $Err_i$ is the error (e.g., Mean Squared Error or Misclassification) calculated on the $i^{th}$ observation when the model was fit using all data except that observation. + +```mermaid +graph TB + TITLE["$$\text{Leave-One-Out Cross-Validation (LOOCV)}$$"] + + %% Dataset + TITLE --> DATA["$$\text{Dataset with } n \text{ Observations}$$"] + + %% Leaving One Out + DATA --> L1["$$\text{Hold Out Observation } 1$$"] + DATA --> L2["$$\text{Hold Out Observation } 2$$"] + DATA --> Li["$$\text{Hold Out Observation } i$$"] + DATA --> Ln["$$\text{Hold Out Observation } n$$"] + + %% Training + L1 --> T1["$$\text{Train on } n-1 \text{ samples}$$"] + L2 --> T2["$$\text{Train on } n-1 \text{ samples}$$"] + Li --> Ti["$$\text{Train on } n-1 \text{ samples}$$"] + Ln --> Tn["$$\text{Train on } n-1 \text{ samples}$$"] + + %% Error Computation + T1 --> E1["$$Err_1$$"] + T2 --> E2["$$Err_2$$"] + Ti --> Ei["$$Err_i$$"] + Tn --> En["$$Err_n$$"] + + %% Averaging Errors + E1 --> AVG["$$CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n} Err_i$$"] + E2 --> AVG + Ei --> AVG + En --> AVG + + AVG --> EST["$$\text{Estimated Test Error}$$"] + +``` + +## 3. When to Use LOOCV? + +### Small Datasets + +When you only have 20 or 50 samples, a standard 80/20 split would leave you with very little data for training. LOOCV allows you to use $n-1$ samples for training, maximizing the model's ability to learn the underlying patterns. + +### Bias vs. Variance + +* **Low Bias:** Since we use almost all the data for training in each step, the model behaves very similarly to how it would if trained on the full dataset. +* **High Variance:** Because the training sets in each iteration are almost identical (overlapping by $n-2$ samples), the outputs are highly correlated. This can lead to a higher variance in the final error estimate compared to K-Fold. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.model_selection import LeaveOneOut, cross_val_score +from sklearn.linear_model import LinearRegression +import numpy as np + +# 1. Initialize data and model +X = np.array([[1], [2], [3], [4]]) +y = np.array([2, 3.9, 6.1, 8.2]) +model = LinearRegression() + +# 2. Initialize LOOCV +loo = LeaveOneOut() + +# 3. Perform Cross-Validation +# This will run 4 times because we have 4 samples +scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error') + +print(f"MSE for each iteration: {np.abs(scores)}") +print(f"Average MSE: {np.abs(scores).mean():.4f}") + +``` + +## 5. LOOCV vs. K-Fold Cross-Validation + +| Feature | LOOCV | K-Fold ($K=10$) | +| --- | --- | --- | +| **Computations** | $N$ (Total samples) | 10 | +| **Computational Cost** | Very High | Moderate | +| **Bias** | Extremely Low | Higher than LOOCV | +| **Variance** | High | Low | +| **Best For** | Small datasets ($N < 100$) | Large/Standard datasets | + + +## 6. The "Shortcut" for Linear Regression + +For certain models like **Linear Regression**, you don't actually have to train the model times. There is a mathematical identity that allows you to calculate the LOOCV error with a single model fit: + +$$ +CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2 +$$ + +Where $h_i$ is the leverage (diagonal of the hat matrix). This makes LOOCV as fast as a single training session for linear models! + +## References + +* **An Introduction to Statistical Learning (ISLR):** Chapter 5.1.2 covers LOOCV in depth. +* **Scikit-Learn:** [LeaveOneOut Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html) + +--- + +**LOOCV is great for small data, but what if your classes are unbalanced (e.g., 99% vs 1%)? Standard LOOCV might struggle to capture the minority class.** diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/train-test-split.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/train-test-split.mdx index e69de29..89bd15f 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/train-test-split.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/validation-techniques/train-test-split.mdx @@ -0,0 +1,94 @@ +--- +title: Train-Test Split +sidebar_label: Train-Test Split +description: "Mastering the data partitioning process to ensure unbiased model evaluation." +tags: [machine-learning, model-evaluation, training, testing, generalization] +--- + +The **Train-Test Split** is a technique used to evaluate the performance of a machine learning algorithm. It involves taking your primary dataset and partitioning it into two separate subsets: one to build the model and another to validate its predictions. + +## 1. Why do we split data? + +In Machine Learning, we don't care how well a model remembers the past; we care how well it predicts the **future**. + +If we train our model on the *entire* dataset, we have no way of knowing if the model actually learned the underlying patterns or if it simply memorized the noise in that specific data. Testing on the same data used for training is a "cardinal sin" known as **Data Leakage**. + +## 2. The Partitioning Logic + +Typically, the data is split into two (or sometimes three) parts: + +1. **Training Set (70-80%):** This is the data used by the algorithm to learn the relationships between features and targets. +2. **Test Set (20-30%):** This data is kept in a "vault." The model never sees it during training. It is used only at the very end to provide an unbiased evaluation. + +```mermaid +graph TB + TITLE["$$\text{Data Partitioning Logic}$$"] + + %% Full Dataset + TITLE --> DATA["$$\text{Full Dataset (100\%)}$$"] + + %% Split + DATA --> TRAIN["$$\text{Training Set}$$
$$70\% \text{ to } 80\%$$"] + DATA --> TEST["$$\text{Test Set}$$
$$20\% \text{ to } 30\%$$"] + + %% Training Path + TRAIN --> LEARN["$$\text{Model Learning}$$"] + LEARN --> FIT["$$\text{Learns Patterns and Relationships}$$"] + + %% Test Path + TEST --> VAULT["$$\text{Evaluation Vault}$$"] + VAULT --> LOCK["$$\text{Never Seen During Training}$$"] + LOCK --> EVAL["$$\text{Final Unbiased Evaluation}$$"] + + %% Emphasis + FIT -.->|"$$\text{Training Only}$$"| TRAIN + EVAL -.->|"$$\text{Used Once at the End}$$"| TEST + +``` + +## 3. Important Considerations + +### Randomness and Reproducibility + +When splitting data, we use a random process. However, for scientific consistency, we use a **Random State** (seed). This ensures that every time you run your code, you get the exact same split, making your experiments reproducible. + +### Stratification + +If you are working with imbalanced classes (e.g., 90% "Healthy", 10% "Sick"), a simple random split might accidentally put all the "Sick" cases in the training set and none in the test set. +**Stratified Splitting** ensures that the proportion of classes is preserved in both the training and testing subsets. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.model_selection import train_test_split + +# Assume X contains features and y contains the target +X_train, X_test, y_train, y_test = train_test_split( + X, + y, + test_size=0.2, # 20% for testing + random_state=42, # For reproducibility + stratify=y # Keep class proportions equal +) + +print(f"Training samples: {len(X_train)}") +print(f"Testing samples: {len(X_test)}") + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Simplicity:** Very easy to understand and implement. | **High Variance:** If the dataset is small, a different random split can lead to very different results. | +| **Speed:** Fast to compute, as the model is only trained once. | **Waste of Data:** A portion of your valuable data is never used to train the model. | +| **Standard Practice:** The universal starting point for any ML project. | **Not for Time-Series:** Random splitting ruins data where order matters (e.g., Stock prices). | + +## References + +* **Scikit-Learn:** [train_test_split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) +* **Google ML Crash Course:** [Splitting Data](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data) + +--- + +**A single split is a good start, but what if your "random" test set happens to be particularly easy or hard? To solve this, we use a more robust technique.** \ No newline at end of file