Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: K-Fold Cross-Validation
sidebar_label: K-Fold Cross-Validation
description: "Mastering robust model evaluation by rotating training and testing sets to maximize data utility."
tags: [machine-learning, model-evaluation, cross-validation, k-fold, generalization]
---

While a [Train-Test Split](./train-test-split) is a great starting point, it has a major weakness: your results can vary significantly depending on which specific rows end up in the test set.

**K-Fold Cross-Validation** solves this by repeating the split process multiple times and averaging the results, ensuring every single data point gets to be part of the "test set" at least once.

## 1. How the Algorithm Works

The process follows a simple rotation logic:
1. **Split** the data into **K** equal-sized "folds" (usually $K=5$ or $K=10$).
2. **Iterate:** For each fold $i$:
* Treat Fold $i$ as the **Test Set**.
* Treat the remaining $K-1$ folds as the **Training Set**.
* Train the model and record the score.
3. **Aggregate:** Calculate the mean and standard deviation of all $K$ scores.

## 2. Visualizing the Process

```mermaid
graph TB
TITLE["$$\text{K-Fold Cross-Validation}$$"]

%% Dataset
TITLE --> DATA["$$\text{Full Dataset}$$"]

%% Folds
DATA --> F1["$$\text{Fold 1}$$"]
DATA --> F2["$$\text{Fold 2}$$"]
DATA --> F3["$$\text{Fold 3}$$"]
DATA --> Fk["$$\text{Fold } k$$"]

%% Iterations
F1 --> I1["$$\text{Iteration 1}$$<br/>$$\text{Validation: Fold 1}$$<br/>$$\text{Training: Others}$$"]
F2 --> I2["$$\text{Iteration 2}$$<br/>$$\text{Validation: Fold 2}$$<br/>$$\text{Training: Others}$$"]
F3 --> I3["$$\text{Iteration 3}$$<br/>$$\text{Validation: Fold 3}$$<br/>$$\text{Training: Others}$$"]
Fk --> Ik["$$\text{Iteration } k$$<br/>$$\text{Validation: Fold } k$$<br/>$$\text{Training: Others}$$"]

%% Model Training & Evaluation
I1 --> M1["$$\text{Train Model}$$"]
I2 --> M2["$$\text{Train Model}$$"]
I3 --> M3["$$\text{Train Model}$$"]
Ik --> Mk["$$\text{Train Model}$$"]

M1 --> S1["$$\text{Score}_1$$"]
M2 --> S2["$$\text{Score}_2$$"]
M3 --> S3["$$\text{Score}_3$$"]
Mk --> Sk["$$\text{Score}_k$$"]

%% Final Result
S1 --> AVG["$$\text{Average Score}$$"]
S2 --> AVG
S3 --> AVG
Sk --> AVG

AVG --> PERF["$$\text{Cross-Validated Performance}$$"]

```

## 3. Why Use K-Fold?

### A. Reliability (Reducing Variance)

By averaging 10 different test scores, you get a much more stable estimate of how the model will perform on new data. It eliminates the "luck of the draw."

### B. Maximum Data Utility

In a standard split, 20% of your data is never used for training. In K-Fold, every data point is used for training $K-1$ times and for testing exactly once. This is especially vital for small datasets.

### C. Hyperparameter Tuning

K-Fold is the foundation for **Grid Search**. It helps you find the best settings for your model (like the depth of a tree) without overfitting to one specific validation set.

## 4. Implementation with Scikit-Learn

```python
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

# 1. Initialize model and data
model = RandomForestClassifier()

# 2. Define the K-Fold strategy
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# 3. Perform Cross-Validation
# This returns an array of 5 scores
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print(f"Scores for each fold: {scores}")
print(f"Mean Accuracy: {scores.mean():.4f}")
print(f"Standard Deviation: {scores.std():.4f}")

```

## 5. Variations of Cross-Validation

* **Stratified K-Fold:** Used for imbalanced data. It ensures each fold has the same percentage of samples for each class as the whole dataset.
* **Leave-One-Out (LOOCV):** A extreme case where $K$ equals the total number of samples ($N$). Extremely computationally expensive but uses the most data possible.
* **Time-Series Split:** Unlike random K-Fold, this respects the chronological order of data (Training on the past, testing on the future).

## 6. Pros and Cons

| Advantages | Disadvantages |
| --- | --- |
| **Robustness:** Provides a more accurate measure of model generalization. | **Computationally Expensive:** Training the model $K$ times takes $K$ times longer. |
| **Confidence:** The standard deviation tells you how "stable" the model is. | **Not for Big Data:** If your model takes 10 hours to train, doing it 10 times is often impractical. |

## References

* **Scikit-Learn:** [Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html)
* **StatQuest:** [K-Fold Cross-Validation Explained](https://www.youtube.com/watch?v=fSytzGwwBVw)

---

**Now that you have a robust way to validate your model, how do you handle data where the classes are heavily skewed (e.g., 99% vs 1%)?**
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
title: "Leave-One-Out Cross-Validation (LOOCV)"
sidebar_label: LOOCV
description: "The most exhaustive validation technique: training on N-1 samples and testing on a single observation."
tags: [machine-learning, model-evaluation, loocv, cross-validation, small-data]
---

**Leave-One-Out Cross-Validation (LOOCV)** is an extreme case of [K-Fold Cross-Validation](./k-fold-cross-validation). Instead of splitting the data into 5 or 10 groups, LOOCV sets $K$ equal to $N$, the total number of data points in your set.

In each iteration, the model is trained on every data point except **one**, which is used as the test set.

## 1. How the Algorithm Works

If you have a dataset with $n$ samples:
1. **Select** the first sample to be the test set.
2. **Train** the model on the remaining $n-1$ samples.
3. **Evaluate** the model on the single test sample and record the error.
4. **Repeat** this process $n$ times, so that each sample serves as the test set exactly once.
5. **Average** the $n$ resulting errors to get the final performance metric.

## 2. Mathematical Representation

The LOOCV estimate of the test error is the average of these $n$ test errors:

$$
CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} Err_i
$$

Where $Err_i$ is the error (e.g., Mean Squared Error or Misclassification) calculated on the $i^{th}$ observation when the model was fit using all data except that observation.

```mermaid
graph TB
TITLE["$$\text{Leave-One-Out Cross-Validation (LOOCV)}$$"]

%% Dataset
TITLE --> DATA["$$\text{Dataset with } n \text{ Observations}$$"]

%% Leaving One Out
DATA --> L1["$$\text{Hold Out Observation } 1$$"]
DATA --> L2["$$\text{Hold Out Observation } 2$$"]
DATA --> Li["$$\text{Hold Out Observation } i$$"]
DATA --> Ln["$$\text{Hold Out Observation } n$$"]

%% Training
L1 --> T1["$$\text{Train on } n-1 \text{ samples}$$"]
L2 --> T2["$$\text{Train on } n-1 \text{ samples}$$"]
Li --> Ti["$$\text{Train on } n-1 \text{ samples}$$"]
Ln --> Tn["$$\text{Train on } n-1 \text{ samples}$$"]

%% Error Computation
T1 --> E1["$$Err_1$$"]
T2 --> E2["$$Err_2$$"]
Ti --> Ei["$$Err_i$$"]
Tn --> En["$$Err_n$$"]

%% Averaging Errors
E1 --> AVG["$$CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n} Err_i$$"]
E2 --> AVG
Ei --> AVG
En --> AVG

AVG --> EST["$$\text{Estimated Test Error}$$"]

```

## 3. When to Use LOOCV?

### Small Datasets

When you only have 20 or 50 samples, a standard 80/20 split would leave you with very little data for training. LOOCV allows you to use $n-1$ samples for training, maximizing the model's ability to learn the underlying patterns.

### Bias vs. Variance

* **Low Bias:** Since we use almost all the data for training in each step, the model behaves very similarly to how it would if trained on the full dataset.
* **High Variance:** Because the training sets in each iteration are almost identical (overlapping by $n-2$ samples), the outputs are highly correlated. This can lead to a higher variance in the final error estimate compared to K-Fold.

## 4. Implementation with Scikit-Learn

```python
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

# 1. Initialize data and model
X = np.array([[1], [2], [3], [4]])
y = np.array([2, 3.9, 6.1, 8.2])
model = LinearRegression()

# 2. Initialize LOOCV
loo = LeaveOneOut()

# 3. Perform Cross-Validation
# This will run 4 times because we have 4 samples
scores = cross_val_score(model, X, y, cv=loo, scoring='neg_mean_squared_error')

print(f"MSE for each iteration: {np.abs(scores)}")
print(f"Average MSE: {np.abs(scores).mean():.4f}")

```

## 5. LOOCV vs. K-Fold Cross-Validation

| Feature | LOOCV | K-Fold ($K=10$) |
| --- | --- | --- |
| **Computations** | $N$ (Total samples) | 10 |
| **Computational Cost** | Very High | Moderate |
| **Bias** | Extremely Low | Higher than LOOCV |
| **Variance** | High | Low |
| **Best For** | Small datasets ($N < 100$) | Large/Standard datasets |


## 6. The "Shortcut" for Linear Regression

For certain models like **Linear Regression**, you don't actually have to train the model times. There is a mathematical identity that allows you to calculate the LOOCV error with a single model fit:

$$
CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2
$$

Where $h_i$ is the leverage (diagonal of the hat matrix). This makes LOOCV as fast as a single training session for linear models!

## References

* **An Introduction to Statistical Learning (ISLR):** Chapter 5.1.2 covers LOOCV in depth.
* **Scikit-Learn:** [LeaveOneOut Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html)

---

**LOOCV is great for small data, but what if your classes are unbalanced (e.g., 99% vs 1%)? Standard LOOCV might struggle to capture the minority class.**
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
title: Train-Test Split
sidebar_label: Train-Test Split
description: "Mastering the data partitioning process to ensure unbiased model evaluation."
tags: [machine-learning, model-evaluation, training, testing, generalization]
---

The **Train-Test Split** is a technique used to evaluate the performance of a machine learning algorithm. It involves taking your primary dataset and partitioning it into two separate subsets: one to build the model and another to validate its predictions.

## 1. Why do we split data?

In Machine Learning, we don't care how well a model remembers the past; we care how well it predicts the **future**.

If we train our model on the *entire* dataset, we have no way of knowing if the model actually learned the underlying patterns or if it simply memorized the noise in that specific data. Testing on the same data used for training is a "cardinal sin" known as **Data Leakage**.

## 2. The Partitioning Logic

Typically, the data is split into two (or sometimes three) parts:

1. **Training Set (70-80%):** This is the data used by the algorithm to learn the relationships between features and targets.
2. **Test Set (20-30%):** This data is kept in a "vault." The model never sees it during training. It is used only at the very end to provide an unbiased evaluation.

```mermaid
graph TB
TITLE["$$\text{Data Partitioning Logic}$$"]

%% Full Dataset
TITLE --> DATA["$$\text{Full Dataset (100\%)}$$"]

%% Split
DATA --> TRAIN["$$\text{Training Set}$$<br/>$$70\% \text{ to } 80\%$$"]
DATA --> TEST["$$\text{Test Set}$$<br/>$$20\% \text{ to } 30\%$$"]

%% Training Path
TRAIN --> LEARN["$$\text{Model Learning}$$"]
LEARN --> FIT["$$\text{Learns Patterns and Relationships}$$"]

%% Test Path
TEST --> VAULT["$$\text{Evaluation Vault}$$"]
VAULT --> LOCK["$$\text{Never Seen During Training}$$"]
LOCK --> EVAL["$$\text{Final Unbiased Evaluation}$$"]

%% Emphasis
FIT -.->|"$$\text{Training Only}$$"| TRAIN
EVAL -.->|"$$\text{Used Once at the End}$$"| TEST

```

## 3. Important Considerations

### Randomness and Reproducibility

When splitting data, we use a random process. However, for scientific consistency, we use a **Random State** (seed). This ensures that every time you run your code, you get the exact same split, making your experiments reproducible.

### Stratification

If you are working with imbalanced classes (e.g., 90% "Healthy", 10% "Sick"), a simple random split might accidentally put all the "Sick" cases in the training set and none in the test set.
**Stratified Splitting** ensures that the proportion of classes is preserved in both the training and testing subsets.

## 4. Implementation with Scikit-Learn

```python
from sklearn.model_selection import train_test_split

# Assume X contains features and y contains the target
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2, # 20% for testing
random_state=42, # For reproducibility
stratify=y # Keep class proportions equal
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

```

## 5. Pros and Cons

| Advantages | Disadvantages |
| --- | --- |
| **Simplicity:** Very easy to understand and implement. | **High Variance:** If the dataset is small, a different random split can lead to very different results. |
| **Speed:** Fast to compute, as the model is only trained once. | **Waste of Data:** A portion of your valuable data is never used to train the model. |
| **Standard Practice:** The universal starting point for any ML project. | **Not for Time-Series:** Random splitting ruins data where order matters (e.g., Stock prices). |

## References

* **Scikit-Learn:** [train_test_split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
* **Google ML Crash Course:** [Splitting Data](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data)

---

**A single split is a good start, but what if your "random" test set happens to be particularly easy or hard? To solve this, we use a more robust technique.**