diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx
index e69de29..dce4642 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx
@@ -0,0 +1,128 @@
+---
+title: "Accuracy: The Intuitive Metric"
+sidebar_label: Accuracy
+description: "Understanding the most common evaluation metric, its formula, and its fatal flaws in imbalanced datasets."
+tags: [machine-learning, model-evaluation, metrics, classification]
+---
+
+**Accuracy** is the most basic and intuitive metric used to evaluate a classification model. In simple terms, it answers the question: *"Out of all the predictions made, how many were correct?"*
+
+## 1. The Mathematical Formula
+
+Accuracy is calculated by dividing the number of correct predictions by the total number of input samples.
+
+Using the components of a [Confusion Matrix](./confusion-matrix), the formula is:
+
+$$
+\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
+$$
+
+Where:
+
+* **TP (True Positives):** Correctly predicted positive samples.
+* **TN (True Negatives):** Correctly predicted negative samples.
+* **FP (False Positives):** Incorrectly predicted as positive.
+* **FN (False Negatives):** Incorrectly predicted as negative.
+
+**Example:**
+
+Imagine you have a dataset of 100 emails, where 80 are spam and 20 are not spam. Your model makes the following predictions:
+
+| Actual \ Predicted | Spam | Not Spam |
+| --- | --- | --- |
+| **Spam** | 70 (TP) | 10 (FN) |
+| **Not Spam** | 5 (FP) | 15 (TN) |
+
+Using the formula:
+
+$$
+\text{Accuracy} = \frac{70 + 15}{70 + 15 + 5 + 10} = \frac{85}{100} = 0.85 \text{ or } 85\%
+$$
+
+This means your model correctly identified 85% of the emails.
+
+## 2. When Accuracy Works Best
+
+Accuracy is a reliable metric **only** when your dataset is **balanced**. 
+
+* **Example:** You are building a model to classify images as either "Cats" or "Dogs." Your dataset has 500 cats and 500 dogs.
+* If your model gets an accuracy of 90%, you can be confident that it is performing well across both categories.
+
+## 3. The "Accuracy Paradox" (Imbalanced Data)
+
+Accuracy becomes highly misleading when one class significantly outweighs the other. This is known as the **Accuracy Paradox**.
+
+### The Scenario:
+
+Imagine a Rare Disease test where only **1%** of the population is actually sick.
+
+1.  If a "lazy" model is programmed to simply say **"Healthy"** for every single patient...
+2.  It will be **99% accurate**.
+
+```mermaid
+graph LR
+    POP["$$\text{Population (100\%)}$$"]
+
+    POP --> H["$$99\% \ \text{Healthy}$$"]
+    POP --> S["$$1\% \ \text{Sick (Rare Disease)}$$"]
+
+    %% Lazy Model
+    H --> PH["$$\text{Predicted: Healthy}$$"]
+    S --> PS["$$\text{Predicted: Healthy}$$"]
+
+    PH --> ACC1["$$\text{True Negatives (99\%)}$$"]
+    PS --> ERR1["$$\text{False Negatives (1\%)}$$"]
+
+    ACC1 --> MET["$$\text{Accuracy} = \frac{99}{100} = 99\%$$"]
+
+    ERR1 --> FAIL["$$\text{❌ All Sick Patients Missed}$$"]
+
+    MET -.->|"$$\text{Accuracy Paradox}$$"| FAIL
+
+```
+
+**The problem?** Even though the accuracy is 99%, the model failed to find the 1% of people who actually need help. In high-stakes fields like medicine or fraud detection, accuracy is often the least important metric.
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import accuracy_score
+
+# Actual target values
+y_true = [0, 1, 1, 0, 1, 1]
+
+# Model predictions
+y_pred = [0, 1, 0, 0, 1, 1]
+
+# Calculate Accuracy
+score = accuracy_score(y_true, y_pred)
+
+print(f"Accuracy: {score * 100:.2f}%")
+# Output: Accuracy: 83.33%
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Simple to understand:** Easy to explain to non-technical stakeholders. | **Useless for Imbalance:** Can hide poor performance on minority classes. |
+| **Single Number:** Provides a quick, high-level overview of model health. | **Ignores Probability:** Doesn't tell you how confident the model was in its choice. |
+| **Standardized:** Used across almost every classification project. | **Cost Blind:** Treats "False Positives" and "False Negatives" as equally bad. |
+
+## 6. How to move beyond Accuracy?
+
+To get a true picture of your model's performance—especially if your data is "skewed"—you should look at Accuracy alongside:
+
+* **Precision:** How many of the predicted positives were actually positive?
+* **Recall:** How many of the actual positives did we successfully find?
+* **F1-Score:** The harmonic mean of Precision and Recall.
+
+## References
+
+* **Google Developers:** [Classification: Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)
+* **StatQuest:** [Accuracy, Precision, and Recall](https://www.youtube.com/watch?v=Kdsp6soqA7o)
+
+---
+
+**If Accuracy isn't enough to catch rare diseases or credit card fraud, what is?** Stay tuned for our next chapter on **Precision & Recall** to find out!
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx
index e69de29..63deaff 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx
@@ -0,0 +1,143 @@
+---
+title: The Confusion Matrix
+sidebar_label: Confusion Matrix
+description: "The foundation of classification evaluation: True Positives, False Positives, True Negatives, and False Negatives."
+tags: [machine-learning, model-evaluation, metrics, classification, confusion-matrix]
+---
+
+A **Confusion Matrix** is a table used to describe the performance of a classification model. While "Accuracy" tells you how often the model is correct, the Confusion Matrix tells you exactly **how** it is failing and which classes are being swapped.
+
+## 1. The 2x2 Layout
+
+For a binary classification (Yes/No, Spam/Ham), the matrix consists of four quadrants:
+
+| | Predicted: **Negative** | Predicted: **Positive** |
+| :--- | :--- | :--- |
+| **Actual: Negative** | **True Negative (TN)** | **False Positive (FP)** |
+| **Actual: Positive** | **False Negative (FN)** | **True Positive (TP)** |
+
+### Breaking Down the Quadrants:
+* **True Positive (TP):** You predicted positive, and it was true. (e.g., You predicted a patient has cancer, and they do).
+* **True Negative (TN):** You predicted negative, and it was true. (e.g., You predicted a patient is healthy, and they are).
+* **False Positive (FP):** You predicted positive, but it was false. (Also known as a **Type I Error** or a "False Alarm").
+* **False Negative (FN):** You predicted negative, but it was positive. (Also known as a **Type II Error** or a "Miss").
+
+## 2. Type I vs. Type II Errors
+
+The "cost" of these errors depends entirely on your specific problem.
+
+```mermaid
+graph TB
+    TITLE["$$\text{Type I vs. Type II Errors}$$"]
+
+    %% Ground Truth
+    TITLE --> TRUTH["$$\text{Actual Condition}$$"]
+    TRUTH --> POS["$$\text{Positive (Condition Present)}$$"]
+    TRUTH --> NEG["$$\text{Negative (Condition Absent)}$$"]
+
+    %% Model Decisions
+    POS --> TP["$$\text{True Positive}$$"]
+    POS --> FN["$$\text{Type II Error}$$<br/>$$\text{False Negative}$$"]
+
+    NEG --> TN["$$\text{True Negative}$$"]
+    NEG --> FP["$$\text{Type I Error}$$<br/>$$\text{False Positive}$$"]
+
+    %% Costs
+    FP --> COST1["$$\text{Cost Depends on Context}$$"]
+    FN --> COST2["$$\text{Cost Depends on Context}$$"]
+
+    %% Examples
+    COST1 --> EX1["$$\text{Example: Spam Filter}$$<br/>$$\text{Important Email Blocked}$$"]
+    COST2 --> EX2["$$\text{Example: Medical Test}$$<br/>$$\text{Disease Missed}$$"]
+
+    %% Emphasis
+    EX1 -.->|"$$\text{Type I Cost High}$$"| FP
+    EX2 -.->|"$$\text{Type II Cost High}$$"| FN
+
+```
+
+* **In Cancer Detection:** A **Type II Error (FN)** is much worse because a sick patient goes untreated.
+* **In Spam Filtering:** A **Type I Error (FP)** is worse because an important work email is hidden in the trash.
+
+## 3. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
+import matplotlib.pyplot as plt
+
+# Actual values and Model predictions
+y_true = [0, 1, 0, 1, 0, 1, 1, 0]
+y_pred = [0, 1, 1, 1, 0, 0, 1, 0]
+
+# 1. Generate the matrix
+cm = confusion_matrix(y_true, y_pred)
+
+# 2. Visualize it
+disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
+disp.plot(cmap=plt.cm.Blues)
+plt.show()
+
+```
+
+## 4. Multi-Class Confusion Matrices
+
+The matrix isn't just for binary problems. If you are classifying "Cat," "Dog," and "Bird," your matrix will be 3x3. The diagonal line from top-left to bottom-right represents correct predictions. Any numbers off that diagonal show you which animals the model is confusing.
+
+```mermaid
+graph TB
+    TITLE["$$\text{Multi-Class Confusion Matrix (3×3)}$$"]
+
+    %% Axes
+    TITLE --> ACT["$$\text{Actual Class}$$"]
+    TITLE --> PRED["$$\text{Predicted Class}$$"]
+
+    ACT --> CAT_A["$$\text{Cat}$$"]
+    ACT --> DOG_A["$$\text{Dog}$$"]
+    ACT --> BIRD_A["$$\text{Bird}$$"]
+
+    PRED --> CAT_P["$$\text{Cat}$$"]
+    PRED --> DOG_P["$$\text{Dog}$$"]
+    PRED --> BIRD_P["$$\text{Bird}$$"]
+
+    %% Diagonal (Correct Predictions)
+    CAT_A --> CAT_P["$$\text{Cat → Cat}$$<br/>$$\text{Correct}$$"]
+    DOG_A --> DOG_P["$$\text{Dog → Dog}$$<br/>$$\text{Correct}$$"]
+    BIRD_A --> BIRD_P["$$\text{Bird → Bird}$$<br/>$$\text{Correct}$$"]
+
+    %% Off-Diagonal (Confusions)
+    CAT_A --> DOG_P["$$\text{Cat → Dog}$$<br/>$$\text{Confusion}$$"]
+    CAT_A --> BIRD_P["$$\text{Cat → Bird}$$<br/>$$\text{Confusion}$$"]
+
+    DOG_A --> CAT_P["$$\text{Dog → Cat}$$<br/>$$\text{Confusion}$$"]
+    DOG_A --> BIRD_P["$$\text{Dog → Bird}$$<br/>$$\text{Confusion}$$"]
+
+    BIRD_A --> CAT_P["$$\text{Bird → Cat}$$<br/>$$\text{Confusion}$$"]
+    BIRD_A --> DOG_P["$$\text{Bird → Dog}$$<br/>$$\text{Confusion}$$"]
+
+    %% Emphasis
+    CAT_P -.->|"$$\text{Diagonal}$$"| GOOD["$$\text{Correct Predictions}$$"]
+    DOG_P -.->|"$$\text{Diagonal}$$"| GOOD
+    BIRD_P -.->|"$$\text{Diagonal}$$"| GOOD
+
+    DOG_P -.->|"$$\text{Off-Diagonal}$$"| BAD["$$\text{Model Confusion}$$"]
+    BIRD_P -.->|"$$\text{Off-Diagonal}$$"| BAD
+
+```
+
+## 5. Summary: What can we calculate from here?
+
+The Confusion Matrix is the "mother" of all classification metrics. From these four numbers, we derive:
+
+* **Accuracy:** 
+* **Precision:** 
+* **Recall:** 
+* **F1-Score:** The balance between Precision and Recall.
+
+## References
+
+* **StatQuest:** [Confusion Matrices Explained](https://www.youtube.com/watch?v=Kdsp6soqA7o)
+* **Scikit-Learn:** [Confusion Matrix API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
+
+---
+
+**Now that you can see where the model is making mistakes, let's learn how to turn those mistakes into a single score.**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx
index e69de29..7a780a1 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx
@@ -0,0 +1,104 @@
+---
+title: "F1-Score: The Balanced Metric"
+sidebar_label: F1-Score
+description: "Mastering the harmonic mean of Precision and Recall to evaluate models on imbalanced datasets."
+tags: [machine-learning, model-evaluation, metrics, f1-score, classification]
+---
+
+The **F1-Score** is a single metric that combines [Precision](./precision) and [Recall](./recall) into a single value. It is particularly useful when you have an imbalanced dataset and you need to find an optimal balance between "False Positives" and "False Negatives."
+
+## 1. The Mathematical Formula
+
+The F1-Score is the **harmonic mean** of Precision and Recall. Unlike a simple average, the harmonic mean punishes extreme values. If either Precision or Recall is very low, the F1-Score will also be low.
+
+$$
+F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
+$$
+
+
+### Why use the Harmonic Mean?
+
+If we used a standard arithmetic average, a model with 1.0 Precision and 0.0 Recall would have a "decent" score of 0.5. However, such a model is useless. The harmonic mean ensures that if one metric is 0, the total score is 0.
+
+## 2. When to Use the F1-Score
+
+F1-Score is the best choice when:
+
+1.  **Imbalanced Classes:** You have a large number of "Negative" samples and few "Positive" ones (e.g., Fraud detection).
+2.  **Equal Importance:** You care equally about minimizing False Positives (Precision) and False Negatives (Recall).
+
+## 3. Visualizing the Balance
+
+Think of the F1-Score as a "balance scale." If you tilt too far toward catching everyone (Recall), your precision drops. If you tilt too far toward being perfectly accurate (Precision), you miss people. The F1-Score is highest when these two are in equilibrium.
+
+```mermaid
+graph TB
+    SCALE["$$\text{F1-Score}$$<br/>$$\text{Balance Scale}$$"]
+
+    %% Precision Side
+    SCALE --> P["$$\text{Precision}$$"]
+    P --> P1["$$\text{Few False Positives}$$"]
+    P1 --> P2["$$\text{Strict Threshold}$$"]
+    P2 --> P3["$$\text{Misses True Positives}$$"]
+    P3 --> P4["$$\text{Low Recall}$$"]
+
+    %% Recall Side
+    SCALE --> R["$$\text{Recall}$$"]
+    R --> R1["$$\text{Few False Negatives}$$"]
+    R1 --> R2["$$\text{Loose Threshold}$$"]
+    R2 --> R3["$$\text{Many False Positives}$$"]
+    R3 --> R4["$$\text{Low Precision}$$"]
+
+    %% Balance Point
+    P4 -.->|"$$\text{Too Strict}$$"| UNBAL["$$\text{Unbalanced Model}$$"]
+    R4 -.->|"$$\text{Too Loose}$$"| UNBAL
+
+    P --> BAL["$$\text{Equilibrium}$$"]
+    R --> BAL
+
+    BAL --> F1["$$\text{F1} = 2 \cdot \frac{P \cdot R}{P + R}$$"]
+    F1 --> OPT["$$\text{Maximum F1-Score}$$"]
+
+```
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import f1_score
+
+# Actual target values
+y_true = [0, 1, 1, 0, 1, 1, 0]
+
+# Model predictions
+y_pred = [0, 1, 0, 0, 1, 1, 1]
+
+# Calculate F1-Score
+score = f1_score(y_true, y_pred)
+
+print(f"F1-Score: {score:.2f}")
+# Output: F1-Score: 0.75
+
+```
+
+## 5. Summary Table: Which Metric to Trust?
+
+| Scenario | Best Metric | Why? |
+| --- | --- | --- |
+| **Balanced Data** | **Accuracy** | Simple and representative. |
+| **Spam Filter** | **Precision** | False Positives (real mail in spam) are very bad. |
+| **Cancer Screen** | **Recall** | False Negatives (missing a sick patient) are fatal. |
+| **Fraud Detection** | **F1-Score** | Need to catch thieves (Recall) without blocking everyone (Precision). |
+
+## 6. Beyond Binary: Macro vs. Weighted F1
+
+If you have more than two classes (Multi-class classification), you'll see these options:
+
+* **Macro F1:** Calculates F1 for each class and takes the unweighted average. Treats all classes as equal.
+* **Weighted F1:** Calculates F1 for each class but weights them by the number of samples in that class.
+
+## References
+
+* **Scikit-Learn:** [F1 Score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
+* **Towards Data Science:** [The F1 Score Paradox](https://towardsdatascience.com/the-f1-score-2236378a31).
+
+**The F1-Score gives us a snapshot at a single threshold. But how do we evaluate a model's performance across ALL possible thresholds?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx
index e69de29..03742fa 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx
@@ -0,0 +1,125 @@
+---
+title: "Log Loss (Logarithmic Loss): The Probability Penalty"
+sidebar_label: Log Loss
+description: "Understanding cross-entropy loss and why it is the gold standard for evaluating probability-based classifiers."
+tags: [machine-learning, model-evaluation, metrics, log-loss, classification, probability]
+---
+
+**Log Loss**, also known as **Cross-Entropy Loss**, is a performance metric that evaluates a classification model based on its **predicted probabilities**. Unlike [Accuracy](./accuracy), which only looks at the final label, Log Loss punishes models that are "confidently wrong."
+
+:::note
+**Prerequisites:** Familiarity with basic classification concepts like **predicted probabilities** and **binary labels** (0 and 1). If you're new to these concepts, consider reviewing the [Confusion Matrix](./confusion-matrix) documentation first.
+:::
+
+## 1. The Core Intuition: The Penalty System
+
+Log Loss measures the "closeness" of a prediction probability to the actual binary label ($0$ or $1$). 
+
+* If the actual label is **1** and the model predicts **0.99**, the Log Loss is very low.
+* If the actual label is **1** and the model predicts **0.01**, the Log Loss is extremely high.
+
+**Crucially:** Log Loss penalizes wrong predictions exponentially. It is better to be "unsure" (0.5) than to be "confidently wrong" (0.01 when the answer is 1).
+
+## 2. The Mathematical Formula
+
+For a binary classification problem, the Log Loss is calculated as:
+
+$$
+\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]
+$$
+
+**Where:**
+
+* **$N$:** Total number of samples.
+* **$y_i$:** Actual label (0 or 1).
+* **$p_i$:** Predicted probability of the sample belonging to class 1.
+* **$\log$:** The natural logarithm (base $e$).
+
+**Breaking it down:**
+* If the actual label $y_i$ is **1**, the formula simplifies to $-\log(p_i)$. The closer $p_i$ is to 1, the lower the loss.
+* If the actual label $y_i$ is **0**, the formula simplifies to $-\\log(1 - p_i)$. The closer $p_i$ is to 0, the lower the loss.
+
+
+## 3. Comparison with Accuracy
+
+Imagine two models predicting a single sample where the true label is **1**.
+
+| Model | Predicted Probability | Prediction (Threshold 0.5) | Accuracy | Log Loss |
+| :--- | :--- | :--- | :--- | :--- |
+| **Model A** | **0.95** | Correct | 100% | **Low** (Good) |
+| **Model B** | **0.51** | Correct | 100% | **High** (Weak) |
+
+Even though both models have the same **Accuracy**, Log Loss tells us that **Model A** is superior because it is more certain about the correct answer.
+
+## 4. Implementation with Scikit-Learn
+
+To calculate Log Loss, you must use `predict_proba()` to get the raw probabilities.
+
+```python
+from sklearn.metrics import log_loss
+
+# Actual labels
+y_true = [1, 0, 1, 1]
+
+# Predicted probabilities for the '1' class
+y_probs = [0.9, 0.1, 0.8, 0.4]
+
+# Calculate Log Loss
+score = log_loss(y_true, y_probs)
+
+print(f"Log Loss: {score:.4f}")
+# Output: Log Loss: 0.3522
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Probability-Focused:** Captures the nuances of model confidence. | **Non-Intuitive:** A value of "0.21" is harder to explain to a business client than "90% accuracy." |
+| **Optimizable:** This is the loss function used to train models like Logistic Regression and Neural Networks. | **Sensitive to Outliers:** A single prediction of 0% probability for a class that turns out to be true will result in a Log Loss of **infinity**. |
+
+---
+
+## 6. Key Takeaway: The "Sureness" Factor
+
+A "perfect" model has a Log Loss of 0. A baseline model that simply predicts a 50/50 chance for every sample will have a Log Loss of approximately 0.693 ($-\ln(0.5)$). If your model's Log Loss is higher than 0.693, it is performing worse than a random guess!
+
+```mermaid
+graph TB
+    TITLE["$$\text{Log Loss: The Sureness Factor}$$"]
+
+    TITLE --> SCALE["$$\text{Lower Log Loss Means Higher Confidence}$$"]
+
+    %% Perfect Model
+    SCALE --> PERFECT["$$\text{Perfect Model}$$"]
+    PERFECT --> L0["$$\text{Log Loss} = 0$$"]
+    L0 --> C0["$$\text{Always Correct with Full Confidence}$$"]
+
+    %% Good Model
+    SCALE --> GOOD["$$\text{Good Model}$$"]
+    GOOD --> LG["$$\text{Log Loss less than } 0.693$$"]
+    LG --> CG["$$\text{Mostly Correct and Well-Calibrated}$$"]
+
+    %% Random Guess Baseline
+    SCALE --> RAND["$$\text{Random Guess Baseline}$$"]
+    RAND --> L05["$$-\ln(0.5) \approx 0.693$$"]
+    L05 --> C05["$$\text{Predicts 50/50 for Every Sample}$$"]
+
+    %% Worse Than Random
+    SCALE --> WORSE["$$\text{Worse Than Random}$$"]
+    WORSE --> LW["$$\text{Log Loss greater than } 0.693$$"]
+    LW --> CW["$$\text{Confident but Frequently Wrong}$$"]
+
+    LW -.->|"$$\text{Danger Zone}$$"| WARN["$$\text{Model is Misleading}$$"]
+
+```
+
+## References
+
+* **Scikit-Learn:** [Log Loss Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)
+* **Machine Learning Mastery:** [A Gentle Introduction to Cross-Entropy](https://machinelearningmastery.com/cross-entropy-for-machine-learning/)
+
+---
+
+**We have explored almost every way to evaluate a classifier. Now, let's switch gears and look at how we measure errors in numbers and values.**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx
index e69de29..17ed6aa 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx
@@ -0,0 +1,102 @@
+---
+title: "Precision: The Quality Metric for Positive Predictions"
+sidebar_label: Precision
+description: "Understanding Precision, its mathematical foundation, and why it is vital for minimizing False Positives."
+tags: [machine-learning, model-evaluation, metrics, classification, precision]
+---
+
+**Precision** (also known as Positive Predictive Value) measures the accuracy of the model's positive predictions. It answers the question: *"Of all the times the model predicted 'Positive', how many were actually 'Positive'?"*
+
+## 1. The Mathematical Formula
+
+Precision is calculated by taking the number of correctly predicted positive results and dividing it by the total number of positive predictions made by the model.
+
+$$
+\text{Precision} = \frac{TP}{TP + FP}
+$$
+
+Where:
+
+* **TP (True Positives):** Correctly predicted positive samples.
+* **FP (False Positives):** The "False Alarms"—cases where the model predicted positive, but it was actually negative.
+
+## 2. When Precision is the Priority
+
+You should prioritize Precision when the **cost of a False Positive is high**. In other words, you want to be very sure when you cry "wolf."
+
+### Real-World Example: Spam Filtering
+
+* **Positive Class:** An email is Spam.
+* **False Positive:** A legitimate email from your boss is marked as "Spam."
+* **The Goal:** We want high Precision. It is better to let a few spam emails into the Inbox (low Recall) than to accidentally hide an important work email (low Precision).
+
+```mermaid
+graph LR
+    MAIL["$$\text{Incoming Emails}$$"]
+
+    MAIL --> LEGIT["$$\text{Legitimate Emails}$$"]
+    MAIL --> SPAM["$$\text{Spam Emails}$$"]
+
+    %% Classifier Decisions
+    LEGIT --> FP["$$\text{❌ False Positive}$$<br/>$$\text{Legit → Spam}$$"]
+    SPAM --> TP["$$\text{✅ True Positive}$$<br/>$$\text{Spam → Spam}$$"]
+
+    %% Inbox / Spam Folder
+    FP --> SPAMBOX["$$\text{Spam Folder}$$"]
+    TP --> SPAMBOX
+
+    LEGIT --> INBOX["$$\text{Inbox}$$"]
+    SPAM --> FN["$$\text{False Negative}$$<br/>$$\text{Spam → Inbox}$$"]
+    FN --> INBOX
+
+    %% Precision Highlight
+    TP --> PREC["$$\text{Precision} = \frac{TP}{TP + FP}$$"]
+    FP --> PREC
+
+    FP -.->|"$$\text{High Cost Error}$$"| ALERT["$$\text{Important Email Missed!}$$"]
+
+```
+
+## 3. The Precision-Recall Trade-off
+
+There is usually a tug-of-war between Precision and Recall. 
+
+* If you make your model extremely "picky" (only predicting positive when it is 99.9% certain), your **Precision will increase**, but you will miss many actual positive cases (**Recall will decrease**).
+* Conversely, if your model is very "sensitive" and flags everything that looks remotely suspicious, your **Recall will increase**, but you will get many false alarms (**Precision will decrease**).
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import precision_score
+
+# Actual target values (e.g., 1 = Spam, 0 = Inbox)
+y_true = [0, 1, 0, 0, 1, 1, 0]
+
+# Model predictions
+y_pred = [0, 0, 1, 0, 1, 1, 0]
+
+# Calculate Precision
+# The 'pos_label' parameter specifies which class is considered "Positive"
+score = precision_score(y_true, y_pred)
+
+print(f"Precision Score: {score:.2f}")
+# Output: Precision Score: 0.67
+# (Out of 3 'Spam' predictions, only 2 were actually Spam)
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Minimizes False Alarms:** Crucial for user trust (e.g., avoiding wrong medical diagnoses). | **Ignores Missed Cases:** Doesn't care about the positive cases the model missed completely. |
+| **High Specificity:** Focuses purely on the quality of the positive class predictions. | **Can be Manipulated:** A model can have 100% precision by only making one single, very safe prediction. |
+
+## References
+
+* **Scikit-Learn Documentation:** [Precision-Recall-F1](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics)
+* **StatQuest:** [Precision and Recall Clearly Explained](https://www.youtube.com/watch?v=Kdsp6soqA7o)
+
+---
+
+**Precision tells us how "reliable" our positive predictions are. But what about the cases we missed entirely?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx
index e69de29..be1b0e1 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx
@@ -0,0 +1,110 @@
+---
+title: "Recall: The Sensitivity Metric"
+sidebar_label: Recall
+description: "Understanding Recall, its mathematical definition, and why it is critical for minimizing False Negatives."
+tags: [machine-learning, model-evaluation, metrics, classification, recall]
+---
+
+**Recall**, also known as **Sensitivity** or **True Positive Rate (TPR)**, measures the model's ability to find all the positive samples in a dataset. It answers the question: *"Of all the actual positive cases that exist, how many did the model correctly identify?"*
+
+## 1. The Mathematical Formula
+
+Recall is calculated by dividing the number of correctly predicted positive results by the total number of actual positives (those we caught + those we missed).
+
+$$
+\text{Recall} = \frac{TP}{TP + FN}
+$$
+
+Where:
+
+* **TP (True Positives):** Correctly predicted positive samples.
+* **FN (False Negatives):** The "Misses"—cases that were actually positive, but the model incorrectly labeled them as negative.
+
+## 2. When Recall is the Top Priority
+
+You should prioritize Recall when the **cost of a False Negative is extremely high**. In these scenarios, it is better to have a few "false alarms" than to miss a single positive case.
+
+### Real-World Example: Cancer Detection
+
+<img src="/tutorial/img/tutorials/ml/cancer-detection.jpg" alt="Cancer Detection" className="rounded"/>
+
+* **Positive Class:** Patient has cancer.
+* **False Negative:** The patient has cancer, but the model says they are "Healthy."
+* **The Consequence:** The patient does not receive treatment, which could be fatal. 
+* **The Goal:** We want **100% Recall**. We would rather tell a healthy person they need more tests (False Positive) than tell a sick person they are fine (False Negative).
+
+```mermaid
+graph LR
+    POP["$$\text{Patients}$$"]
+
+    POP --> C["$$\text{Has Cancer (Positive)}$$"]
+    POP --> H["$$\text{Healthy (Negative)}$$"]
+
+    %% Model Predictions
+    C --> FN["$$\text{❌ False Negative}$$<br/>$$\text{Cancer → Healthy}$$"]
+    C --> TP["$$\text{✅ True Positive}$$<br/>$$\text{Cancer → Cancer}$$"]
+
+    H --> FP["$$\text{False Positive}$$<br/>$$\text{Healthy → Cancer}$$"]
+    H --> TN["$$\text{True Negative}$$<br/>$$\text{Healthy → Healthy}$$"]
+
+    %% Consequences
+    FN --> RISK["$$\text{🚨 No Treatment Given}$$"]
+    RISK --> FATAL["$$\text{Potentially Fatal Outcome}$$"]
+
+    FP --> SAFE["$$\text{Extra Tests / Monitoring}$$"]
+    SAFE --> OK["$$\text{Acceptable Cost}$$"]
+
+    %% Recall Highlight
+    TP --> REC["$$\text{Recall} = \frac{TP}{TP + FN}$$"]
+    FN --> REC
+
+    FN -.->|"$$\text{Most Dangerous Error}$$"| FATAL
+
+```
+
+## 3. The Precision-Recall Inverse Relationship
+
+As you saw in the [Precision module](./precision), there is an inherent trade-off.
+
+* **To increase Recall:** You can make your model "less strict." If a bank flags *every* transaction as potentially fraudulent, it will have 100% Recall (it caught every thief), but its Precision will be terrible (it blocked every honest customer too).
+* **To increase Precision:** You make the model "more strict," which inevitably leads to missing some positive cases, thereby lowering Recall.
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import recall_score
+
+# Actual target values (1 = Sick, 0 = Healthy)
+y_true = [1, 1, 1, 0, 1, 0, 1]
+
+# Model predictions
+y_pred = [1, 0, 1, 0, 1, 0, 0]
+
+# Calculate Recall
+score = recall_score(y_true, y_pred)
+
+print(f"Recall Score: {score:.2f}")
+# Output: Recall Score: 0.60
+# (We found 3 out of 5 sick people; we missed 2)
+
+```
+
+## 5. Summary Table: Precision vs. Recall
+
+| Metric | Focus | Goal | Failure Mode |
+| --- | --- | --- | --- |
+| **Precision** | Quality | "Don't cry wolf." | High Precision misses many real cases. |
+| **Recall** | Quantity | "Leave no one behind." | High Recall creates many false alarms. |
+
+## 6. How to Balance Both?
+
+If you need a single number that accounts for both the "False Alarms" of Precision and the "Misses" of Recall, you need the **F1-Score**.
+
+## References
+
+* **Google Machine Learning Crash Course:** [Recall Metric](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall)
+* **Wikipedia:** [Sensitivity and Specificity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)
+
+---
+
+**Is your model struggling to choose between Precision and Recall? Let's look at the "middle ground" metric.**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx
index e69de29..007b1e7 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx
@@ -0,0 +1,123 @@
+---
+title: ROC Curve and AUC
+sidebar_label: ROC & AUC
+description: "Evaluating classifier performance across all thresholds using the Receiver Operating Characteristic and Area Under the Curve."
+tags: [machine-learning, model-evaluation, metrics, roc, auc, classification]
+---
+
+In classification tasks, especially binary classification, it's crucial to evaluate how well a model distinguishes between the positive and negative classes. The **ROC Curve (Receiver Operating Characteristic Curve)** and **AUC (Area Under the Curve)** are powerful tools for this purpose.
+
+> While metrics like **Accuracy** and **F1-Score** evaluate a model based on a single "cut-off" or threshold (usually 0.5), **ROC** and **AUC** help us evaluate how well the model separates the two classes across **all possible thresholds**.
+
+:::note
+**Prerequisites:** Familiarity with basic classification metrics like **True Positives (TP)**, **False Positives (FP)**, **True Negatives (TN)**, and **False Negatives (FN)**. If you're new to these concepts, consider reviewing the [Confusion Matrix](./confusion-matrix) documentation first.
+:::
+
+## 1. Defining the Terms
+
+To understand the ROC curve, we need to look at two specific rates:
+
+1.  **True Positive Rate (TPR) / Recall:** $TPR = \frac{TP}{TP + FN}$ 
+    (How many of the actual positives did we catch?)
+2.  **False Positive Rate (FPR):** $FPR = \frac{FP}{FP + TN}$ 
+    (How many of the actual negatives did we incorrectly flag as positive?)
+
+## 2. The ROC Curve (Receiver Operating Characteristic)
+
+The ROC curve is a plot of **TPR (Y-axis)** against **FPR (X-axis)**. 
+
+* **Each point** on the curve represents a TPR/FPR pair corresponding to a particular decision threshold.
+* **A "Perfect" Classifier** would have a curve that goes straight up the Y-axis and then across (covering the top-left corner).
+* **A Random Classifier** (like flipping a coin) is represented by a 45-degree diagonal line.
+
+## 3. AUC (Area Under the Curve)
+
+The **AUC** is the literal area under the ROC curve. It provides an aggregate measure of performance across all possible classification thresholds.
+
+* **AUC = 1.0:** A perfect model (100% correct predictions).
+* **AUC = 0.5:** A useless model (no better than random guessing).
+* **AUC = 0.0:** A model that is perfectly wrong (it predicts the exact opposite of the truth).
+
+**Interpretation:** AUC can be thought of as the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one.
+
+## 4. Why use ROC-AUC?
+
+1.  **Scale-Invariant:** It measures how well predictions are ranked, rather than their absolute values.
+2.  **Threshold-Invariant:** It evaluates the model's performance without having to choose a specific threshold. This is great if you haven't decided yet how "picky" the model should be.
+3.  **Balanced Evaluation:** It is highly effective for comparing different models against each other on the same dataset.
+
+## 5. Implementation with Scikit-Learn
+
+To calculate AUC, you usually need the **prediction probabilities** rather than the hard class labels.
+
+```python
+from sklearn.metrics import roc_curve, roc_auc_score
+import matplotlib.pyplot as plt
+
+# 1. Get probability scores from the model
+# (Assume model is already trained)
+y_probs = model.predict_proba(X_test)[:, 1] 
+
+# 2. Calculate AUC
+auc_value = roc_auc_score(y_test, y_probs)
+print(f"AUC Score: {auc_value:.4f}")
+
+# 3. Generate the curve points
+fpr, tpr, thresholds = roc_curve(y_test, y_probs)
+
+# 4. Plotting
+plt.plot(fpr, tpr, label=f'AUC = {auc_value:.2f}')
+plt.plot([0, 1], [0, 1], 'k--') # Diagonal random line
+plt.xlabel('False Positive Rate')
+plt.ylabel('True Positive Rate')
+plt.title('ROC Curve')
+plt.legend()
+plt.show()
+
+```
+
+## 6. The Logic of Overlap
+
+The higher the AUC, the less the "Positive" and "Negative" probability distributions overlap. When the overlap is zero, the model can perfectly distinguish between the two.
+
+```mermaid
+graph TB
+    TITLE["$$\text{ROC–AUC: Logic of Overlap}$$"]
+
+    %% Distributions
+    TITLE --> NEG["$$\text{Negative Class}$$<br/>$$P(\hat{y}\mid y=0)$$"]
+    TITLE --> POS["$$\text{Positive Class}$$<br/>$$P(\hat{y}\mid y=1)$$"]
+
+    %% High Overlap
+    NEG --> HO["$$\text{High Overlap}$$"]
+    POS --> HO
+    HO --> AUC1["$$\text{AUC} \approx 0.5$$"]
+    AUC1 --> R1["$$\text{Random Guessing}$$"]
+
+    %% Medium Overlap
+    NEG --> MO["$$\text{Partial Overlap}$$"]
+    POS --> MO
+    MO --> AUC2["$$0.7 \le \text{AUC} \le 0.85$$"]
+    AUC2 --> R2["$$\text{Useful but Imperfect Separation}$$"]
+
+    %% Zero Overlap
+    NEG --> ZO["$$\text{Zero Overlap}$$"]
+    POS --> ZO
+    ZO --> AUC3["$$\text{AUC} = 1.0$$"]
+    AUC3 --> R3["$$\text{Perfect Classifier}$$"]
+
+    %% Threshold Intuition
+    R1 -.->|"$$\text{Any Threshold Fails}$$"| TH["$$\text{Decision Threshold}$$"]
+    R2 -.->|"$$\text{Some Errors}$$"| TH
+    R3 -.->|"$$\text{Perfect Split}$$"| TH
+
+```
+
+## References
+
+* **Google Machine Learning Crash Course:** [ROC Curve and AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)
+* **StatQuest:** [ROC and AUC Clearly Explained](https://www.youtube.com/watch?v=4jRBRDbJemM)
+
+---
+
+**We have mastered classification metrics. But how do we evaluate a model that predicts continuous numbers, like house prices or stock trends?**
\ No newline at end of file
diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx
index e69de29..10f3e2f 100644
--- a/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx
+++ b/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx
@@ -0,0 +1,78 @@
+---
+title: Why Model Evaluation Matters
+sidebar_label: Why Evaluation Matters
+description: "Understanding the difference between training performance and real-world reliability."
+tags: [machine-learning, model-evaluation, metrics, generalization]
+---
+
+Building a machine learning model is only half the battle. The most dangerous mistake a Data Scientist can make is assuming that a model with **99% accuracy** on the training data will perform just as well in the real world.
+
+**Model Evaluation** is the process of using different metrics and validation strategies to understand how well your model generalizes to data it has never seen before.
+
+## 1. The Trap of "Memorization" (Overfitting)
+
+If you give a student the exact same questions from their textbook on their final exam, they might get a 100% just by memorizing the answers. However, if you give them a new problem and they fail, they haven't actually learned the subject.
+
+In Machine Learning, this is called **Overfitting**.
+
+* **Training Error:** How well the model performs on the data it studied.
+* **Generalization Error:** How well the model performs on new, unseen data.
+
+**The Goal:** We want to minimize the Generalization Error, not just the Training Error.
+
+## 2. The Bias-Variance Tradeoff
+
+Every model's error can be broken down into two main components:
+
+### Bias (Underfitting)
+The error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs.
+* *Analogy:* Trying to fit a straight line through a curved set of points.
+
+### Variance (Overfitting)
+
+The error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data.
+* *Analogy:* Following every single data point so closely that the model becomes "wiggly."
+
+```mermaid
+graph LR
+    A[Low Bias / High Variance] --- B((Optimal Balance))
+    B --- C[High Bias / Low Variance]
+    
+    style B fill:#e8f5e9,stroke:#2e7d32,stroke-width:4px,color:#333
+
+```
+
+## 3. The Evaluation Workflow
+
+To evaluate a model properly, we never use the same data for training and testing. We typically split our dataset into three parts:
+
+| Split | Purpose |
+| --- | --- |
+| **Training Set** | Used to teach the model (The "Textbook"). |
+| **Validation Set** | Used to tune hyperparameters and pick the best model version. |
+| **Test Set** | The "Final Exam." Used only once at the very end to see real-world performance. |
+
+## 4. Why Accuracy Isn't Enough
+
+Imagine a model designed to detect a very rare disease that only affects 1 in 1,000 people.
+If the model simply predicts **"Healthy"** for everyone, it will be **99.9% accurate**.
+
+However, it is a **useless model** because it failed to find the 1 person who was actually sick. This is why we need more advanced metrics like:
+
+* **Precision & Recall** (For Classification)
+* **Mean Absolute Error** (For Regression)
+* **F1-Score** (For Imbalanced Data)
+
+## 5. The Evaluation Roadmap
+
+In the upcoming chapters, we will dive deep into specific evaluation tools:
+
+1. **Confusion Matrices:** Seeing exactly where your classifier is getting confused.
+2. **ROC & AUC:** Understanding the trade-off between sensitivity and specificity.
+3. **Cross-Validation:** Making the most of limited data.
+4. **Regression Metrics:** Measuring the "distance" between reality and prediction.
+
+## References
+
+* **Google Machine Learning Crash Course:** [Generalization](https://developers.google.com/machine-learning/crash-course/generalization/video-lecture)
+* **StatQuest:** [Bias and Variance](https://www.youtube.com/watch?v=EuBBz3bI-aA)
\ No newline at end of file
diff --git a/static/img/tutorials/ml/cancer-detection.jpg b/static/img/tutorials/ml/cancer-detection.jpg
new file mode 100644
index 0000000..6f4cba3
Binary files /dev/null and b/static/img/tutorials/ml/cancer-detection.jpg differ