Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
title: Decision Trees
sidebar_label: Decision Trees
description: "Understanding recursive partitioning, Entropy, Gini Impurity, and how to prevent overfitting in tree-based models."
tags: [machine-learning, supervised-learning, classification, decision-trees, cart]
---

A **Decision Tree** is a non-parametric supervised learning method used for both classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Think of a Decision Tree as a flow chart where each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.

## 1. Anatomy of a Tree

* **Root Node:** The very top node that represents the entire dataset. It is the first split.
* **Internal Node:** A point where the data is split based on a specific feature.
* **Leaf Node:** The final output nodes that contain the prediction. No further splits occur here.
* **Branches:** The paths connecting nodes based on the outcome of a decision.

## 2. How the Tree Decides to Split

The algorithm aims to split the data into subsets that are as "pure" as possible. A subset is pure if all data points in it belong to the same class.

### Gini Impurity

This is the default metric used by Scikit-Learn. It measures the probability of a random sample being misclassified.

$$
Gini = 1 - \sum_{i=1}^{n} (p_i)^2
$$

**Where:**

* $p_i$ is the probability of an object being classified to a particular class.

### Information Gain (Entropy)

Based on Information Theory, it measures the "disorder" or uncertainty in the data.

$$
H(S) = -\sum_{i=1}^{n} p_i \log_2(p_i)
$$

**Where:**

* $p_i$ is the proportion of instances in class $i$.

## 3. The Problem of Overfitting

Decision Trees are notorious for **overfitting**. Left unchecked, a tree will continue to split until every single data point has its own leaf, essentially "memorizing" the training data rather than finding patterns.

**How to stop the tree from growing too much:**
* **max_depth:** Limit how "tall" the tree can get.
* **min_samples_split:** The minimum number of samples required to split an internal node.
* **min_samples_leaf:** The minimum number of samples required to be at a leaf node.
* **Pruning:** Removing branches that provide little power to classify instances.

```mermaid
graph LR
X["$$X$$ (Training Data)"] --> ODT["Overfitted Decision Tree"]

ODT --> O1["$$\text{Very Deep Tree}$$"]
O1 --> O2["$$\text{Many Splits}$$"]
O2 --> O3["$$\text{Memorizes Noise}$$"]
O3 --> O4["$$\text{Low Bias,\ High Variance}$$"]
O4 --> O5["$$\text{Training Accuracy} \approx 100\%$$"]
O5 --> O6["$$\text{Poor Generalization}$$"]

X --> PDT["Pruned Decision Tree"]

PDT --> P1["$$\text{Limited Depth}$$"]
P1 --> P2["$$\text{Fewer Splits}$$"]
P2 --> P3["$$\text{Removes Irrelevant Branches}$$"]
P3 --> P4["$$\text{Balanced Bias–Variance}$$"]
P4 --> P5["$$\text{Better Test Accuracy}$$"]
P5 --> P6["$$\text{Good Generalization}$$"]

O6 -.->|"$$\text{Comparison}$$"| P6
```

In this diagram, we see two paths from the same training data: one leading to an overfitted decision tree and the other to a pruned decision tree. The overfitted tree has very low bias but high variance, resulting in nearly perfect training accuracy but poor generalization to new data. In contrast, the pruned tree balances bias and variance, leading to better test accuracy and generalization.

## 4. Implementation with Scikit-Learn

```python
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# 1. Initialize with constraints to prevent overfitting
model = DecisionTreeClassifier(max_depth=3, criterion='gini')

# 2. Train
model.fit(X_train, y_train)

# 3. Visualize the Tree
plt.figure(figsize=(12,8))
plot_tree(model, filled=True, feature_names=feature_cols)
plt.show()

```

## 5. Pros and Cons

| Advantages | Disadvantages |
| --- | --- |
| **Interpretable:** Easy to explain to non-technical stakeholders. | **High Variance:** Small changes in data can result in a completely different tree. |
| **No Scaling Required:** Does not require feature normalization or standardization. | **Overfitting:** Extremely prone to capturing noise in the data. |
| Handles both numerical and categorical data. | **Bias:** Can create biased trees if some classes dominate. |

## 6. Mathematical Visualisation

```mermaid
graph TD
A[Is Income > $50k?] -->|Yes| B[Is Credit Score > 700?]
A -->|No| C[Reject Loan]
B -->|Yes| D[Approve Loan]
B -->|No| E[Reject Loan]

style A fill:#f3e5f5,stroke:#7b1fa2,color:#333
style D fill:#e8f5e9,stroke:#2e7d32,color:#333
style C fill:#ffebee,stroke:#c62828,color:#333
style E fill:#ffebee,stroke:#c62828,color:#333

```

## References for More Details

* **[Scikit-Learn Tree Module](https://scikit-learn.org/stable/modules/tree.html):** Understanding the algorithmic implementation (CART).

---

**Single Decision Trees are weak learners. To build a truly robust model, we combine hundreds of trees into a "forest."**
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
---
title: "Gradient Boosting: Learning from Mistakes"
sidebar_label: Gradient Boosting
description: "Exploring the power of Sequential Ensemble Learning, Gradient Descent, and popular frameworks like XGBoost and LightGBM."
tags: [machine-learning, supervised-learning, classification, boosting, xgboost]
---

**Gradient Boosting** is an ensemble technique that builds models sequentially. Unlike [Random Forest](./random-forest), which builds trees independently in parallel, Gradient Boosting builds one tree at a time, where each new tree attempts to correct the errors (residuals) made by the previous trees.

## 1. How Boosting Works

The core idea is **Additive Modeling**. We start with a very simple model and keep adding "corrective" models until the error is minimized.

1. **Base Model:** Start with a simple prediction (usually the mean of the target values).
2. **Calculate Residuals:** Find the difference between the actual values and the current prediction.
3. **Train on Errors:** Fit a new "weak" decision tree to predict those residuals (errors), not the actual target.
4. **Update Prediction:** Add the new tree's prediction to the previous model's prediction.
5. **Repeat:** Continue this process for $N$ iterations.

## 2. Gradient Descent in Boosting

Gradient Boosting gets its name because it uses the **Gradient Descent** algorithm to minimize the loss function.

In each step, the algorithm identifies the direction in which the loss (error) decreases most rapidly and adds a new tree that moves the model in that direction.

$$
F_{m}(x) = F_{m-1}(x) + \nu \cdot h_m(x)
$$

* $F_{m}(x)$: The updated model.
* $F_{m-1}(x)$: The model from the previous step.
* $\nu$ (Nu): The **Learning Rate** (Shrinkage). It scales the contribution of each tree to prevent overfitting.
* $h_m(x)$: The new tree trained on residuals.

## 3. Key Hyperparameters

* **learning_rate:** Determines how much each tree contributes to the final result. Lower values usually require more trees but lead to better generalization.
* **n_estimators:** The number of sequential trees to be modeled.
* **subsample:** The fraction of samples to be used for fitting the individual base learners. Using less than 1.0 leads to **Stochastic Gradient Boosting**.
* **max_depth:** Limits the complexity of each individual tree (usually kept shallow, e.g., 3-5).

## 4. Popular Implementations

While Scikit-Learn has a `GradientBoostingClassifier`, the data science community often uses specialized libraries for better speed and performance:

1. **XGBoost (Extreme Gradient Boosting):** Optimized for speed and performance; includes built-in regularization.
2. **LightGBM:** Uses a "leaf-wise" growth strategy; extremely fast and memory-efficient for large datasets.
3. **CatBoost:** Specifically designed to handle categorical features automatically without manual encoding.

```mermaid
graph LR
subgraph LGBM["Leaf-wise Tree Growth (LightGBM)"]
A1["Root"] --> B1["Leaf A"]
A1 --> C1["Leaf B"]

B1 --> D1["Split on<br/>$$\Delta\text{Loss}_{max}$$"]
D1 --> E1["Deeper Branch"]

C1 --> F1["Unsplit Leaf"]

E1 --> G1["Complex Boundary<br/>$$\text{Low Bias}$$"]
G1 --> H1["$$\text{Risk of Overfitting}$$"]
end

subgraph XGB["Level-wise Tree Growth (XGBoost)"]
A2["Root"] --> B2["Level 1 – Left"]
A2 --> C2["Level 1 – Right"]

B2 --> D2["Level 2 – Left"]
B2 --> E2["Level 2 – Right"]
C2 --> F2["Level 2 – Left"]
C2 --> G2["Level 2 – Right"]

D2 --> H2["Balanced Tree"]
E2 --> H2
F2 --> H2
G2 --> H2

H2 --> I2["Stable Boundary<br/>$$\text{Lower Variance}$$"]
end

H1 -.->|"$$\text{Bias–Variance Tradeoff}$$"| I2

```

## 5. Implementation with Scikit-Learn

```python
from sklearn.ensemble import GradientBoostingClassifier

# 1. Initialize the Gradient Booster
# Note: learning_rate and n_estimators have a trade-off
gbc = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)

# 2. Train the model
gbc.fit(X_train, y_train)

# 3. Predict
y_pred = gbc.predict(X_test)

```

## 6. Pros and Cons

| Advantages | Disadvantages |
| --- | --- |
| **State-of-the-art Accuracy:** Often wins Kaggle competitions for tabular data. | **Sequential Training:** Slower to train than Random Forest because trees cannot be built in parallel. |
| **Flexibility:** Can optimize almost any differentiable loss function. | **Hyperparameter Sensitive:** Requires careful tuning of learning rate and tree counts to avoid overfitting. |
| **Handles Non-linearities:** Captures complex interactions between features. | **Black Box:** Much harder to interpret than a single Decision Tree. |

## References for More Details

* **[XGBoost Documentation](https://xgboost.readthedocs.io/):** Learning about advanced regularization and hardware acceleration.
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: "Random Forest: Strength in Numbers"
sidebar_label: Random Forest
description: "Understanding Ensemble Learning, Bagging, and how Random Forests reduce variance to build robust classifiers."
tags: [machine-learning, supervised-learning, classification, ensemble-learning, random-forest]
---

A **Random Forest** is an **Ensemble Learning** method that operates by constructing a multitude of [Decision Trees](./decision-trees) during training. For classification tasks, the output of the random forest is the class selected by most trees (majority voting).

The fundamental philosophy of Random Forest is that **a group of "weak learners" can come together to form a "strong learner."**

## 1. The Core Mechanism: Bagging

Random Forest uses a technique called **Bootstrap Aggregating**, or **Bagging**, to ensure that the trees in the forest are different from one another.

1. **Bootstrapping:** The algorithm creates multiple random subsets of the training data. It does this by sampling with replacement (meaning the same row can appear multiple times in one subset).
2. **Feature Randomness:** When splitting a node, the algorithm doesn't look at *all* available features. Instead, it picks a random subset of features. This ensures the trees aren't all looking at the same "obvious" patterns.
3. **Aggregating:** Each tree makes a prediction. The forest takes all those predictions and picks the most popular one.

## 2. Why is Random Forest Better than a Single Tree?

A single Decision Tree is highly sensitive to the specific data it was trained on (High Variance). If you change the data slightly, the tree might look completely different.

Random Forest solves this by **averaging the errors**. While individual trees might overfit to certain noise in their specific bootstrap sample, the "noise" cancels out when you combine 100+ trees, leaving only the true underlying pattern.

```mermaid
graph LR
subgraph DT["Single Decision Tree"]
A1["$$x_1$$"] --> B1["$$x_2 > t_1$$"]
B1 -->|Yes| C1["Region 1<br/>$$\text{Class A}$$"]
B1 -->|No| D1["$$x_1 > t_2$$"]
D1 -->|Yes| E1["Region 2<br/>$$\text{Class B}$$"]
D1 -->|No| F1["Region 3<br/>$$\text{Class A}$$"]
end

subgraph RF["Random Forest"]
A2["$$x_1$$"] --> T1["Tree 1"]
A2 --> T2["Tree 2"]
A2 --> T3["Tree 3"]

T1 --> R1["$$\text{Boundary}_1$$"]
T2 --> R2["$$\text{Boundary}_2$$"]
T3 --> R3["$$\text{Boundary}_3$$"]

R1 --> V["$$\text{Voting / Averaging}$$"]
R2 --> V
R3 --> V

V --> RFinal["Smooth Combined Boundary<br/>$$\text{Lower Variance}$$"]
end

C1 -.->|"$$\text{Blocky / Axis-Aligned}$$"| RFinal

```

## 3. Key Hyperparameters

* **n_estimators:** The number of trees in the forest. Generally, more trees are better, but they increase computational cost.
* **max_features:** The size of the random subsets of features to consider when splitting a node.
* **bootstrap:** Whether to use bootstrap samples or the entire dataset to build trees.
* **oob_score:** "Out-of-Bag" score. This allows the model to be validated using the data points that were *not* picked during the bootstrapping process for a specific tree.

## 4. Feature Importance

One of the greatest features of Random Forest is its ability to tell you which variables were most important in making predictions. It calculates how much the "Gini Impurity" decreases across all trees for a specific feature.

## 5. Implementation with Scikit-Learn

```python
from sklearn.ensemble import RandomForestClassifier

# 1. Initialize the Forest
# n_estimators=100 is a common starting point
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# 2. Train the ensemble
rf.fit(X_train, y_train)

# 3. Predict
y_pred = rf.predict(X_test)

# 4. Check Feature Importance
importances = rf.feature_importances_

```

## 6. Pros and Cons

| Advantages | Disadvantages |
| --- | --- |
| **Robustness:** Highly resistant to overfitting compared to single trees. | **Complexity:** Harder to visualize or explain than a single tree (the "Black Box" problem). |
| **Handles Missing Data:** Can maintain accuracy even when a large proportion of data is missing. | **Performance:** Can be slow to train on very large datasets with thousands of trees. |
| **No Scaling Needed:** Like Decision Trees, it is scale-invariant. | **Size:** The model files can become quite large in memory. |

## References for More Details

* **[Scikit-Learn Ensemble Module](https://scikit-learn.org/stable/modules/ensemble.html%23forests-of-randomized-trees):** Learning about variations like `ExtraTreesClassifier`.

---

**Random Forests use "Bagging" to build trees in parallel. But what if we built trees one after another, with each tree learning from the mistakes of the previous one?**