added few content

ajay-dhangar · ajay-dhangar · commit 1353c90fcdac · 2025-12-30T18:56:49.000+05:30
diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/classification/decision-trees.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/classification/decision-trees.mdx
@@ -0,0 +1,131 @@
+---
+title: Decision Trees
+sidebar_label: Decision Trees
+description: "Understanding recursive partitioning, Entropy, Gini Impurity, and how to prevent overfitting in tree-based models."
+tags: [machine-learning, supervised-learning, classification, decision-trees, cart]
+---
+
+A **Decision Tree** is a non-parametric supervised learning method used for both classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
+
+Think of a Decision Tree as a flow chart where each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.
+
+## 1. Anatomy of a Tree
+
+* **Root Node:** The very top node that represents the entire dataset. It is the first split.
+* **Internal Node:** A point where the data is split based on a specific feature.
+* **Leaf Node:** The final output nodes that contain the prediction. No further splits occur here.
+* **Branches:** The paths connecting nodes based on the outcome of a decision.
+
+## 2. How the Tree Decides to Split
+
+The algorithm aims to split the data into subsets that are as "pure" as possible. A subset is pure if all data points in it belong to the same class.
+
+### Gini Impurity
+
+This is the default metric used by Scikit-Learn. It measures the probability of a random sample being misclassified.
+
+$$
+Gini = 1 - \sum_{i=1}^{n} (p_i)^2
+$$
+
+**Where:**
+
+* $p_i$ is the probability of an object being classified to a particular class.
+
+### Information Gain (Entropy)
+
+Based on Information Theory, it measures the "disorder" or uncertainty in the data.
+
+$$
+H(S) = -\sum_{i=1}^{n} p_i \log_2(p_i)
+$$
+
+**Where:**
+
+* $p_i$ is the proportion of instances in class $i$.
+
+## 3. The Problem of Overfitting
+
+Decision Trees are notorious for **overfitting**. Left unchecked, a tree will continue to split until every single data point has its own leaf, essentially "memorizing" the training data rather than finding patterns.
+
+**How to stop the tree from growing too much:**
+* **max_depth:** Limit how "tall" the tree can get.
+* **min_samples_split:** The minimum number of samples required to split an internal node.
+* **min_samples_leaf:** The minimum number of samples required to be at a leaf node.
+* **Pruning:** Removing branches that provide little power to classify instances.
+
+```mermaid
+graph LR
+    X["$$X$$ (Training Data)"] --> ODT["Overfitted Decision Tree"]
+
+    ODT --> O1["$$\text{Very Deep Tree}$$"]
+    O1 --> O2["$$\text{Many Splits}$$"]
+    O2 --> O3["$$\text{Memorizes Noise}$$"]
+    O3 --> O4["$$\text{Low Bias,\ High Variance}$$"]
+    O4 --> O5["$$\text{Training Accuracy} \approx 100\%$$"]
+    O5 --> O6["$$\text{Poor Generalization}$$"]
+
+    X --> PDT["Pruned Decision Tree"]
+
+    PDT --> P1["$$\text{Limited Depth}$$"]
+    P1 --> P2["$$\text{Fewer Splits}$$"]
+    P2 --> P3["$$\text{Removes Irrelevant Branches}$$"]
+    P3 --> P4["$$\text{Balanced Bias–Variance}$$"]
+    P4 --> P5["$$\text{Better Test Accuracy}$$"]
+    P5 --> P6["$$\text{Good Generalization}$$"]
+
+    O6 -.->|"$$\text{Comparison}$$"| P6
+```
+
+In this diagram, we see two paths from the same training data: one leading to an overfitted decision tree and the other to a pruned decision tree. The overfitted tree has very low bias but high variance, resulting in nearly perfect training accuracy but poor generalization to new data. In contrast, the pruned tree balances bias and variance, leading to better test accuracy and generalization.
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.tree import DecisionTreeClassifier, plot_tree
+import matplotlib.pyplot as plt
+
+# 1. Initialize with constraints to prevent overfitting
+model = DecisionTreeClassifier(max_depth=3, criterion='gini')
+
+# 2. Train
+model.fit(X_train, y_train)
+
+# 3. Visualize the Tree
+plt.figure(figsize=(12,8))
+plot_tree(model, filled=True, feature_names=feature_cols)
+plt.show()
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Interpretable:** Easy to explain to non-technical stakeholders. | **High Variance:** Small changes in data can result in a completely different tree. |
+| **No Scaling Required:** Does not require feature normalization or standardization. | **Overfitting:** Extremely prone to capturing noise in the data. |
+| Handles both numerical and categorical data. | **Bias:** Can create biased trees if some classes dominate. |
+
+## 6. Mathematical Visualisation
+
+```mermaid
+graph TD
+    A[Is Income > $50k?] -->|Yes| B[Is Credit Score > 700?]
+    A -->|No| C[Reject Loan]
+    B -->|Yes| D[Approve Loan]
+    B -->|No| E[Reject Loan]
+    
+    style A fill:#f3e5f5,stroke:#7b1fa2,color:#333
+    style D fill:#e8f5e9,stroke:#2e7d32,color:#333
+    style C fill:#ffebee,stroke:#c62828,color:#333
+    style E fill:#ffebee,stroke:#c62828,color:#333
+
+```
+
+## References for More Details
+
+* **[Scikit-Learn Tree Module](https://scikit-learn.org/stable/modules/tree.html):** Understanding the algorithmic implementation (CART).
+
+---
+
+**Single Decision Trees are weak learners. To build a truly robust model, we combine hundreds of trees into a "forest."**
diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/classification/gradient-boosting.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/classification/gradient-boosting.mdx
@@ -0,0 +1,118 @@
+---
+title: "Gradient Boosting: Learning from Mistakes"
+sidebar_label: Gradient Boosting
+description: "Exploring the power of Sequential Ensemble Learning, Gradient Descent, and popular frameworks like XGBoost and LightGBM."
+tags: [machine-learning, supervised-learning, classification, boosting, xgboost]
+---
+
+**Gradient Boosting** is an ensemble technique that builds models sequentially. Unlike [Random Forest](./random-forest), which builds trees independently in parallel, Gradient Boosting builds one tree at a time, where each new tree attempts to correct the errors (residuals) made by the previous trees.
+
+## 1. How Boosting Works
+
+The core idea is **Additive Modeling**. We start with a very simple model and keep adding "corrective" models until the error is minimized.
+
+1.  **Base Model:** Start with a simple prediction (usually the mean of the target values).
+2.  **Calculate Residuals:** Find the difference between the actual values and the current prediction.
+3.  **Train on Errors:** Fit a new "weak" decision tree to predict those residuals (errors), not the actual target.
+4.  **Update Prediction:** Add the new tree's prediction to the previous model's prediction.
+5.  **Repeat:** Continue this process for $N$ iterations.
+
+## 2. Gradient Descent in Boosting
+
+Gradient Boosting gets its name because it uses the **Gradient Descent** algorithm to minimize the loss function. 
+
+In each step, the algorithm identifies the direction in which the loss (error) decreases most rapidly and adds a new tree that moves the model in that direction.
+
+$$
+F_{m}(x) = F_{m-1}(x) + \nu \cdot h_m(x)
+$$
+
+* $F_{m}(x)$: The updated model.
+* $F_{m-1}(x)$: The model from the previous step.
+* $\nu$ (Nu): The **Learning Rate** (Shrinkage). It scales the contribution of each tree to prevent overfitting.
+* $h_m(x)$: The new tree trained on residuals.
+
+## 3. Key Hyperparameters
+
+* **learning_rate:** Determines how much each tree contributes to the final result. Lower values usually require more trees but lead to better generalization.
+* **n_estimators:** The number of sequential trees to be modeled.
+* **subsample:** The fraction of samples to be used for fitting the individual base learners. Using less than 1.0 leads to **Stochastic Gradient Boosting**.
+* **max_depth:** Limits the complexity of each individual tree (usually kept shallow, e.g., 3-5).
+
+## 4. Popular Implementations
+
+While Scikit-Learn has a `GradientBoostingClassifier`, the data science community often uses specialized libraries for better speed and performance:
+
+1.  **XGBoost (Extreme Gradient Boosting):** Optimized for speed and performance; includes built-in regularization.
+2.  **LightGBM:** Uses a "leaf-wise" growth strategy; extremely fast and memory-efficient for large datasets.
+3.  **CatBoost:** Specifically designed to handle categorical features automatically without manual encoding.
+
+```mermaid
+graph LR
+    subgraph LGBM["Leaf-wise Tree Growth (LightGBM)"]
+        A1["Root"] --> B1["Leaf A"]
+        A1 --> C1["Leaf B"]
+
+        B1 --> D1["Split on<br/>$$\Delta\text{Loss}_{max}$$"]
+        D1 --> E1["Deeper Branch"]
+
+        C1 --> F1["Unsplit Leaf"]
+
+        E1 --> G1["Complex Boundary<br/>$$\text{Low Bias}$$"]
+        G1 --> H1["$$\text{Risk of Overfitting}$$"]
+    end
+
+    subgraph XGB["Level-wise Tree Growth (XGBoost)"]
+        A2["Root"] --> B2["Level 1 – Left"]
+        A2 --> C2["Level 1 – Right"]
+
+        B2 --> D2["Level 2 – Left"]
+        B2 --> E2["Level 2 – Right"]
+        C2 --> F2["Level 2 – Left"]
+        C2 --> G2["Level 2 – Right"]
+
+        D2 --> H2["Balanced Tree"]
+        E2 --> H2
+        F2 --> H2
+        G2 --> H2
+
+        H2 --> I2["Stable Boundary<br/>$$\text{Lower Variance}$$"]
+    end
+
+    H1 -.->|"$$\text{Bias–Variance Tradeoff}$$"| I2
+
+```
+
+## 5. Implementation with Scikit-Learn
+
+```python
+from sklearn.ensemble import GradientBoostingClassifier
+
+# 1. Initialize the Gradient Booster
+# Note: learning_rate and n_estimators have a trade-off
+gbc = GradientBoostingClassifier(
+    n_estimators=100, 
+    learning_rate=0.1, 
+    max_depth=3, 
+    random_state=42
+)
+
+# 2. Train the model
+gbc.fit(X_train, y_train)
+
+# 3. Predict
+y_pred = gbc.predict(X_test)
+
+```
+
+## 6. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **State-of-the-art Accuracy:** Often wins Kaggle competitions for tabular data. | **Sequential Training:** Slower to train than Random Forest because trees cannot be built in parallel. |
+| **Flexibility:** Can optimize almost any differentiable loss function. | **Hyperparameter Sensitive:** Requires careful tuning of learning rate and tree counts to avoid overfitting. |
+| **Handles Non-linearities:** Captures complex interactions between features. | **Black Box:** Much harder to interpret than a single Decision Tree. |
+
+## References for More Details
+
+* **[XGBoost Documentation](https://xgboost.readthedocs.io/):** Learning about advanced regularization and hardware acceleration.
diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/classification/random-forest.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/classification/random-forest.mdx
@@ -0,0 +1,101 @@
+---
+title: "Random Forest: Strength in Numbers"
+sidebar_label: Random Forest
+description: "Understanding Ensemble Learning, Bagging, and how Random Forests reduce variance to build robust classifiers."
+tags: [machine-learning, supervised-learning, classification, ensemble-learning, random-forest]
+---
+
+A **Random Forest** is an **Ensemble Learning** method that operates by constructing a multitude of [Decision Trees](/tutorial/machine-learning/supervised-learning/classification/decision-trees) during training. For classification tasks, the output of the random forest is the class selected by most trees (majority voting).
+
+The fundamental philosophy of Random Forest is that **a group of "weak learners" can come together to form a "strong learner."**
+
+## 1. The Core Mechanism: Bagging
+
+Random Forest uses a technique called **Bootstrap Aggregating**, or **Bagging**, to ensure that the trees in the forest are different from one another.
+
+1.  **Bootstrapping:** The algorithm creates multiple random subsets of the training data. It does this by sampling with replacement (meaning the same row can appear multiple times in one subset).
+2.  **Feature Randomness:** When splitting a node, the algorithm doesn't look at *all* available features. Instead, it picks a random subset of features. This ensures the trees aren't all looking at the same "obvious" patterns.
+3.  **Aggregating:** Each tree makes a prediction. The forest takes all those predictions and picks the most popular one.
+
+## 2. Why is Random Forest Better than a Single Tree?
+
+A single Decision Tree is highly sensitive to the specific data it was trained on (High Variance). If you change the data slightly, the tree might look completely different.
+
+Random Forest solves this by **averaging the errors**. While individual trees might overfit to certain noise in their specific bootstrap sample, the "noise" cancels out when you combine 100+ trees, leaving only the true underlying pattern.
+
+```mermaid
+graph LR
+    subgraph DT["Single Decision Tree"]
+        A1["$$x_1$$"] --> B1["$$x_2 > t_1$$"]
+        B1 -->|Yes| C1["Region 1<br/>$$\text{Class A}$$"]
+        B1 -->|No| D1["$$x_1 > t_2$$"]
+        D1 -->|Yes| E1["Region 2<br/>$$\text{Class B}$$"]
+        D1 -->|No| F1["Region 3<br/>$$\text{Class A}$$"]
+    end
+
+    subgraph RF["Random Forest"]
+        A2["$$x_1$$"] --> T1["Tree 1"]
+        A2 --> T2["Tree 2"]
+        A2 --> T3["Tree 3"]
+
+        T1 --> R1["$$\text{Boundary}_1$$"]
+        T2 --> R2["$$\text{Boundary}_2$$"]
+        T3 --> R3["$$\text{Boundary}_3$$"]
+
+        R1 --> V["$$\text{Voting / Averaging}$$"]
+        R2 --> V
+        R3 --> V
+
+        V --> RFinal["Smooth Combined Boundary<br/>$$\text{Lower Variance}$$"]
+    end
+
+    C1 -.->|"$$\text{Blocky / Axis-Aligned}$$"| RFinal
+    
+```
+
+## 3. Key Hyperparameters
+
+* **n_estimators:** The number of trees in the forest. Generally, more trees are better, but they increase computational cost.
+* **max_features:** The size of the random subsets of features to consider when splitting a node.
+* **bootstrap:** Whether to use bootstrap samples or the entire dataset to build trees.
+* **oob_score:** "Out-of-Bag" score. This allows the model to be validated using the data points that were *not* picked during the bootstrapping process for a specific tree.
+
+## 4. Feature Importance
+
+One of the greatest features of Random Forest is its ability to tell you which variables were most important in making predictions. It calculates how much the "Gini Impurity" decreases across all trees for a specific feature.
+
+## 5. Implementation with Scikit-Learn
+
+```python
+from sklearn.ensemble import RandomForestClassifier
+
+# 1. Initialize the Forest
+# n_estimators=100 is a common starting point
+rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
+
+# 2. Train the ensemble
+rf.fit(X_train, y_train)
+
+# 3. Predict
+y_pred = rf.predict(X_test)
+
+# 4. Check Feature Importance
+importances = rf.feature_importances_
+
+```
+
+## 6. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Robustness:** Highly resistant to overfitting compared to single trees. | **Complexity:** Harder to visualize or explain than a single tree (the "Black Box" problem). |
+| **Handles Missing Data:** Can maintain accuracy even when a large proportion of data is missing. | **Performance:** Can be slow to train on very large datasets with thousands of trees. |
+| **No Scaling Needed:** Like Decision Trees, it is scale-invariant. | **Size:** The model files can become quite large in memory. |
+
+## References for More Details
+
+* **[Scikit-Learn Ensemble Module](https://scikit-learn.org/stable/modules/ensemble.html%23forests-of-randomized-trees):** Learning about variations like `ExtraTreesClassifier`.
+
+---
+
+**Random Forests use "Bagging" to build trees in parallel. But what if we built trees one after another, with each tree learning from the mistakes of the previous one?**