diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx index e69de29..dce4642 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/accuracy.mdx @@ -0,0 +1,128 @@ +--- +title: "Accuracy: The Intuitive Metric" +sidebar_label: Accuracy +description: "Understanding the most common evaluation metric, its formula, and its fatal flaws in imbalanced datasets." +tags: [machine-learning, model-evaluation, metrics, classification] +--- + +**Accuracy** is the most basic and intuitive metric used to evaluate a classification model. In simple terms, it answers the question: *"Out of all the predictions made, how many were correct?"* + +## 1. The Mathematical Formula + +Accuracy is calculated by dividing the number of correct predictions by the total number of input samples. + +Using the components of a [Confusion Matrix](./confusion-matrix), the formula is: + +$$ +\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} +$$ + +Where: + +* **TP (True Positives):** Correctly predicted positive samples. +* **TN (True Negatives):** Correctly predicted negative samples. +* **FP (False Positives):** Incorrectly predicted as positive. +* **FN (False Negatives):** Incorrectly predicted as negative. + +**Example:** + +Imagine you have a dataset of 100 emails, where 80 are spam and 20 are not spam. Your model makes the following predictions: + +| Actual \ Predicted | Spam | Not Spam | +| --- | --- | --- | +| **Spam** | 70 (TP) | 10 (FN) | +| **Not Spam** | 5 (FP) | 15 (TN) | + +Using the formula: + +$$ +\text{Accuracy} = \frac{70 + 15}{70 + 15 + 5 + 10} = \frac{85}{100} = 0.85 \text{ or } 85\% +$$ + +This means your model correctly identified 85% of the emails. + +## 2. When Accuracy Works Best + +Accuracy is a reliable metric **only** when your dataset is **balanced**. + +* **Example:** You are building a model to classify images as either "Cats" or "Dogs." Your dataset has 500 cats and 500 dogs. +* If your model gets an accuracy of 90%, you can be confident that it is performing well across both categories. + +## 3. The "Accuracy Paradox" (Imbalanced Data) + +Accuracy becomes highly misleading when one class significantly outweighs the other. This is known as the **Accuracy Paradox**. + +### The Scenario: + +Imagine a Rare Disease test where only **1%** of the population is actually sick. + +1. If a "lazy" model is programmed to simply say **"Healthy"** for every single patient... +2. It will be **99% accurate**. + +```mermaid +graph LR + POP["$$\text{Population (100\%)}$$"] + + POP --> H["$$99\% \ \text{Healthy}$$"] + POP --> S["$$1\% \ \text{Sick (Rare Disease)}$$"] + + %% Lazy Model + H --> PH["$$\text{Predicted: Healthy}$$"] + S --> PS["$$\text{Predicted: Healthy}$$"] + + PH --> ACC1["$$\text{True Negatives (99\%)}$$"] + PS --> ERR1["$$\text{False Negatives (1\%)}$$"] + + ACC1 --> MET["$$\text{Accuracy} = \frac{99}{100} = 99\%$$"] + + ERR1 --> FAIL["$$\text{❌ All Sick Patients Missed}$$"] + + MET -.->|"$$\text{Accuracy Paradox}$$"| FAIL + +``` + +**The problem?** Even though the accuracy is 99%, the model failed to find the 1% of people who actually need help. In high-stakes fields like medicine or fraud detection, accuracy is often the least important metric. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.metrics import accuracy_score + +# Actual target values +y_true = [0, 1, 1, 0, 1, 1] + +# Model predictions +y_pred = [0, 1, 0, 0, 1, 1] + +# Calculate Accuracy +score = accuracy_score(y_true, y_pred) + +print(f"Accuracy: {score * 100:.2f}%") +# Output: Accuracy: 83.33% + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Simple to understand:** Easy to explain to non-technical stakeholders. | **Useless for Imbalance:** Can hide poor performance on minority classes. | +| **Single Number:** Provides a quick, high-level overview of model health. | **Ignores Probability:** Doesn't tell you how confident the model was in its choice. | +| **Standardized:** Used across almost every classification project. | **Cost Blind:** Treats "False Positives" and "False Negatives" as equally bad. | + +## 6. How to move beyond Accuracy? + +To get a true picture of your model's performance—especially if your data is "skewed"—you should look at Accuracy alongside: + +* **Precision:** How many of the predicted positives were actually positive? +* **Recall:** How many of the actual positives did we successfully find? +* **F1-Score:** The harmonic mean of Precision and Recall. + +## References + +* **Google Developers:** [Classification: Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy) +* **StatQuest:** [Accuracy, Precision, and Recall](https://www.youtube.com/watch?v=Kdsp6soqA7o) + +--- + +**If Accuracy isn't enough to catch rare diseases or credit card fraud, what is?** Stay tuned for our next chapter on **Precision & Recall** to find out! \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx index e69de29..63deaff 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/confusion-matrix.mdx @@ -0,0 +1,143 @@ +--- +title: The Confusion Matrix +sidebar_label: Confusion Matrix +description: "The foundation of classification evaluation: True Positives, False Positives, True Negatives, and False Negatives." +tags: [machine-learning, model-evaluation, metrics, classification, confusion-matrix] +--- + +A **Confusion Matrix** is a table used to describe the performance of a classification model. While "Accuracy" tells you how often the model is correct, the Confusion Matrix tells you exactly **how** it is failing and which classes are being swapped. + +## 1. The 2x2 Layout + +For a binary classification (Yes/No, Spam/Ham), the matrix consists of four quadrants: + +| | Predicted: **Negative** | Predicted: **Positive** | +| :--- | :--- | :--- | +| **Actual: Negative** | **True Negative (TN)** | **False Positive (FP)** | +| **Actual: Positive** | **False Negative (FN)** | **True Positive (TP)** | + +### Breaking Down the Quadrants: +* **True Positive (TP):** You predicted positive, and it was true. (e.g., You predicted a patient has cancer, and they do). +* **True Negative (TN):** You predicted negative, and it was true. (e.g., You predicted a patient is healthy, and they are). +* **False Positive (FP):** You predicted positive, but it was false. (Also known as a **Type I Error** or a "False Alarm"). +* **False Negative (FN):** You predicted negative, but it was positive. (Also known as a **Type II Error** or a "Miss"). + +## 2. Type I vs. Type II Errors + +The "cost" of these errors depends entirely on your specific problem. + +```mermaid +graph TB + TITLE["$$\text{Type I vs. Type II Errors}$$"] + + %% Ground Truth + TITLE --> TRUTH["$$\text{Actual Condition}$$"] + TRUTH --> POS["$$\text{Positive (Condition Present)}$$"] + TRUTH --> NEG["$$\text{Negative (Condition Absent)}$$"] + + %% Model Decisions + POS --> TP["$$\text{True Positive}$$"] + POS --> FN["$$\text{Type II Error}$$
$$\text{False Negative}$$"] + + NEG --> TN["$$\text{True Negative}$$"] + NEG --> FP["$$\text{Type I Error}$$
$$\text{False Positive}$$"] + + %% Costs + FP --> COST1["$$\text{Cost Depends on Context}$$"] + FN --> COST2["$$\text{Cost Depends on Context}$$"] + + %% Examples + COST1 --> EX1["$$\text{Example: Spam Filter}$$
$$\text{Important Email Blocked}$$"] + COST2 --> EX2["$$\text{Example: Medical Test}$$
$$\text{Disease Missed}$$"] + + %% Emphasis + EX1 -.->|"$$\text{Type I Cost High}$$"| FP + EX2 -.->|"$$\text{Type II Cost High}$$"| FN + +``` + +* **In Cancer Detection:** A **Type II Error (FN)** is much worse because a sick patient goes untreated. +* **In Spam Filtering:** A **Type I Error (FP)** is worse because an important work email is hidden in the trash. + +## 3. Implementation with Scikit-Learn + +```python +from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay +import matplotlib.pyplot as plt + +# Actual values and Model predictions +y_true = [0, 1, 0, 1, 0, 1, 1, 0] +y_pred = [0, 1, 1, 1, 0, 0, 1, 0] + +# 1. Generate the matrix +cm = confusion_matrix(y_true, y_pred) + +# 2. Visualize it +disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive']) +disp.plot(cmap=plt.cm.Blues) +plt.show() + +``` + +## 4. Multi-Class Confusion Matrices + +The matrix isn't just for binary problems. If you are classifying "Cat," "Dog," and "Bird," your matrix will be 3x3. The diagonal line from top-left to bottom-right represents correct predictions. Any numbers off that diagonal show you which animals the model is confusing. + +```mermaid +graph TB + TITLE["$$\text{Multi-Class Confusion Matrix (3×3)}$$"] + + %% Axes + TITLE --> ACT["$$\text{Actual Class}$$"] + TITLE --> PRED["$$\text{Predicted Class}$$"] + + ACT --> CAT_A["$$\text{Cat}$$"] + ACT --> DOG_A["$$\text{Dog}$$"] + ACT --> BIRD_A["$$\text{Bird}$$"] + + PRED --> CAT_P["$$\text{Cat}$$"] + PRED --> DOG_P["$$\text{Dog}$$"] + PRED --> BIRD_P["$$\text{Bird}$$"] + + %% Diagonal (Correct Predictions) + CAT_A --> CAT_P["$$\text{Cat → Cat}$$
$$\text{Correct}$$"] + DOG_A --> DOG_P["$$\text{Dog → Dog}$$
$$\text{Correct}$$"] + BIRD_A --> BIRD_P["$$\text{Bird → Bird}$$
$$\text{Correct}$$"] + + %% Off-Diagonal (Confusions) + CAT_A --> DOG_P["$$\text{Cat → Dog}$$
$$\text{Confusion}$$"] + CAT_A --> BIRD_P["$$\text{Cat → Bird}$$
$$\text{Confusion}$$"] + + DOG_A --> CAT_P["$$\text{Dog → Cat}$$
$$\text{Confusion}$$"] + DOG_A --> BIRD_P["$$\text{Dog → Bird}$$
$$\text{Confusion}$$"] + + BIRD_A --> CAT_P["$$\text{Bird → Cat}$$
$$\text{Confusion}$$"] + BIRD_A --> DOG_P["$$\text{Bird → Dog}$$
$$\text{Confusion}$$"] + + %% Emphasis + CAT_P -.->|"$$\text{Diagonal}$$"| GOOD["$$\text{Correct Predictions}$$"] + DOG_P -.->|"$$\text{Diagonal}$$"| GOOD + BIRD_P -.->|"$$\text{Diagonal}$$"| GOOD + + DOG_P -.->|"$$\text{Off-Diagonal}$$"| BAD["$$\text{Model Confusion}$$"] + BIRD_P -.->|"$$\text{Off-Diagonal}$$"| BAD + +``` + +## 5. Summary: What can we calculate from here? + +The Confusion Matrix is the "mother" of all classification metrics. From these four numbers, we derive: + +* **Accuracy:** +* **Precision:** +* **Recall:** +* **F1-Score:** The balance between Precision and Recall. + +## References + +* **StatQuest:** [Confusion Matrices Explained](https://www.youtube.com/watch?v=Kdsp6soqA7o) +* **Scikit-Learn:** [Confusion Matrix API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) + +--- + +**Now that you can see where the model is making mistakes, let's learn how to turn those mistakes into a single score.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx index e69de29..7a780a1 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/f1-score.mdx @@ -0,0 +1,104 @@ +--- +title: "F1-Score: The Balanced Metric" +sidebar_label: F1-Score +description: "Mastering the harmonic mean of Precision and Recall to evaluate models on imbalanced datasets." +tags: [machine-learning, model-evaluation, metrics, f1-score, classification] +--- + +The **F1-Score** is a single metric that combines [Precision](./precision) and [Recall](./recall) into a single value. It is particularly useful when you have an imbalanced dataset and you need to find an optimal balance between "False Positives" and "False Negatives." + +## 1. The Mathematical Formula + +The F1-Score is the **harmonic mean** of Precision and Recall. Unlike a simple average, the harmonic mean punishes extreme values. If either Precision or Recall is very low, the F1-Score will also be low. + +$$ +F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} +$$ + + +### Why use the Harmonic Mean? + +If we used a standard arithmetic average, a model with 1.0 Precision and 0.0 Recall would have a "decent" score of 0.5. However, such a model is useless. The harmonic mean ensures that if one metric is 0, the total score is 0. + +## 2. When to Use the F1-Score + +F1-Score is the best choice when: + +1. **Imbalanced Classes:** You have a large number of "Negative" samples and few "Positive" ones (e.g., Fraud detection). +2. **Equal Importance:** You care equally about minimizing False Positives (Precision) and False Negatives (Recall). + +## 3. Visualizing the Balance + +Think of the F1-Score as a "balance scale." If you tilt too far toward catching everyone (Recall), your precision drops. If you tilt too far toward being perfectly accurate (Precision), you miss people. The F1-Score is highest when these two are in equilibrium. + +```mermaid +graph TB + SCALE["$$\text{F1-Score}$$
$$\text{Balance Scale}$$"] + + %% Precision Side + SCALE --> P["$$\text{Precision}$$"] + P --> P1["$$\text{Few False Positives}$$"] + P1 --> P2["$$\text{Strict Threshold}$$"] + P2 --> P3["$$\text{Misses True Positives}$$"] + P3 --> P4["$$\text{Low Recall}$$"] + + %% Recall Side + SCALE --> R["$$\text{Recall}$$"] + R --> R1["$$\text{Few False Negatives}$$"] + R1 --> R2["$$\text{Loose Threshold}$$"] + R2 --> R3["$$\text{Many False Positives}$$"] + R3 --> R4["$$\text{Low Precision}$$"] + + %% Balance Point + P4 -.->|"$$\text{Too Strict}$$"| UNBAL["$$\text{Unbalanced Model}$$"] + R4 -.->|"$$\text{Too Loose}$$"| UNBAL + + P --> BAL["$$\text{Equilibrium}$$"] + R --> BAL + + BAL --> F1["$$\text{F1} = 2 \cdot \frac{P \cdot R}{P + R}$$"] + F1 --> OPT["$$\text{Maximum F1-Score}$$"] + +``` + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.metrics import f1_score + +# Actual target values +y_true = [0, 1, 1, 0, 1, 1, 0] + +# Model predictions +y_pred = [0, 1, 0, 0, 1, 1, 1] + +# Calculate F1-Score +score = f1_score(y_true, y_pred) + +print(f"F1-Score: {score:.2f}") +# Output: F1-Score: 0.75 + +``` + +## 5. Summary Table: Which Metric to Trust? + +| Scenario | Best Metric | Why? | +| --- | --- | --- | +| **Balanced Data** | **Accuracy** | Simple and representative. | +| **Spam Filter** | **Precision** | False Positives (real mail in spam) are very bad. | +| **Cancer Screen** | **Recall** | False Negatives (missing a sick patient) are fatal. | +| **Fraud Detection** | **F1-Score** | Need to catch thieves (Recall) without blocking everyone (Precision). | + +## 6. Beyond Binary: Macro vs. Weighted F1 + +If you have more than two classes (Multi-class classification), you'll see these options: + +* **Macro F1:** Calculates F1 for each class and takes the unweighted average. Treats all classes as equal. +* **Weighted F1:** Calculates F1 for each class but weights them by the number of samples in that class. + +## References + +* **Scikit-Learn:** [F1 Score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) +* **Towards Data Science:** [The F1 Score Paradox](https://towardsdatascience.com/the-f1-score-2236378a31). + +**The F1-Score gives us a snapshot at a single threshold. But how do we evaluate a model's performance across ALL possible thresholds?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx index e69de29..03742fa 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/log-loss.mdx @@ -0,0 +1,125 @@ +--- +title: "Log Loss (Logarithmic Loss): The Probability Penalty" +sidebar_label: Log Loss +description: "Understanding cross-entropy loss and why it is the gold standard for evaluating probability-based classifiers." +tags: [machine-learning, model-evaluation, metrics, log-loss, classification, probability] +--- + +**Log Loss**, also known as **Cross-Entropy Loss**, is a performance metric that evaluates a classification model based on its **predicted probabilities**. Unlike [Accuracy](./accuracy), which only looks at the final label, Log Loss punishes models that are "confidently wrong." + +:::note +**Prerequisites:** Familiarity with basic classification concepts like **predicted probabilities** and **binary labels** (0 and 1). If you're new to these concepts, consider reviewing the [Confusion Matrix](./confusion-matrix) documentation first. +::: + +## 1. The Core Intuition: The Penalty System + +Log Loss measures the "closeness" of a prediction probability to the actual binary label ($0$ or $1$). + +* If the actual label is **1** and the model predicts **0.99**, the Log Loss is very low. +* If the actual label is **1** and the model predicts **0.01**, the Log Loss is extremely high. + +**Crucially:** Log Loss penalizes wrong predictions exponentially. It is better to be "unsure" (0.5) than to be "confidently wrong" (0.01 when the answer is 1). + +## 2. The Mathematical Formula + +For a binary classification problem, the Log Loss is calculated as: + +$$ +\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)] +$$ + +**Where:** + +* **$N$:** Total number of samples. +* **$y_i$:** Actual label (0 or 1). +* **$p_i$:** Predicted probability of the sample belonging to class 1. +* **$\log$:** The natural logarithm (base $e$). + +**Breaking it down:** +* If the actual label $y_i$ is **1**, the formula simplifies to $-\log(p_i)$. The closer $p_i$ is to 1, the lower the loss. +* If the actual label $y_i$ is **0**, the formula simplifies to $-\\log(1 - p_i)$. The closer $p_i$ is to 0, the lower the loss. + + +## 3. Comparison with Accuracy + +Imagine two models predicting a single sample where the true label is **1**. + +| Model | Predicted Probability | Prediction (Threshold 0.5) | Accuracy | Log Loss | +| :--- | :--- | :--- | :--- | :--- | +| **Model A** | **0.95** | Correct | 100% | **Low** (Good) | +| **Model B** | **0.51** | Correct | 100% | **High** (Weak) | + +Even though both models have the same **Accuracy**, Log Loss tells us that **Model A** is superior because it is more certain about the correct answer. + +## 4. Implementation with Scikit-Learn + +To calculate Log Loss, you must use `predict_proba()` to get the raw probabilities. + +```python +from sklearn.metrics import log_loss + +# Actual labels +y_true = [1, 0, 1, 1] + +# Predicted probabilities for the '1' class +y_probs = [0.9, 0.1, 0.8, 0.4] + +# Calculate Log Loss +score = log_loss(y_true, y_probs) + +print(f"Log Loss: {score:.4f}") +# Output: Log Loss: 0.3522 + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Probability-Focused:** Captures the nuances of model confidence. | **Non-Intuitive:** A value of "0.21" is harder to explain to a business client than "90% accuracy." | +| **Optimizable:** This is the loss function used to train models like Logistic Regression and Neural Networks. | **Sensitive to Outliers:** A single prediction of 0% probability for a class that turns out to be true will result in a Log Loss of **infinity**. | + +--- + +## 6. Key Takeaway: The "Sureness" Factor + +A "perfect" model has a Log Loss of 0. A baseline model that simply predicts a 50/50 chance for every sample will have a Log Loss of approximately 0.693 ($-\ln(0.5)$). If your model's Log Loss is higher than 0.693, it is performing worse than a random guess! + +```mermaid +graph TB + TITLE["$$\text{Log Loss: The Sureness Factor}$$"] + + TITLE --> SCALE["$$\text{Lower Log Loss Means Higher Confidence}$$"] + + %% Perfect Model + SCALE --> PERFECT["$$\text{Perfect Model}$$"] + PERFECT --> L0["$$\text{Log Loss} = 0$$"] + L0 --> C0["$$\text{Always Correct with Full Confidence}$$"] + + %% Good Model + SCALE --> GOOD["$$\text{Good Model}$$"] + GOOD --> LG["$$\text{Log Loss less than } 0.693$$"] + LG --> CG["$$\text{Mostly Correct and Well-Calibrated}$$"] + + %% Random Guess Baseline + SCALE --> RAND["$$\text{Random Guess Baseline}$$"] + RAND --> L05["$$-\ln(0.5) \approx 0.693$$"] + L05 --> C05["$$\text{Predicts 50/50 for Every Sample}$$"] + + %% Worse Than Random + SCALE --> WORSE["$$\text{Worse Than Random}$$"] + WORSE --> LW["$$\text{Log Loss greater than } 0.693$$"] + LW --> CW["$$\text{Confident but Frequently Wrong}$$"] + + LW -.->|"$$\text{Danger Zone}$$"| WARN["$$\text{Model is Misleading}$$"] + +``` + +## References + +* **Scikit-Learn:** [Log Loss Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html) +* **Machine Learning Mastery:** [A Gentle Introduction to Cross-Entropy](https://machinelearningmastery.com/cross-entropy-for-machine-learning/) + +--- + +**We have explored almost every way to evaluate a classifier. Now, let's switch gears and look at how we measure errors in numbers and values.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx index e69de29..17ed6aa 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/precision.mdx @@ -0,0 +1,102 @@ +--- +title: "Precision: The Quality Metric for Positive Predictions" +sidebar_label: Precision +description: "Understanding Precision, its mathematical foundation, and why it is vital for minimizing False Positives." +tags: [machine-learning, model-evaluation, metrics, classification, precision] +--- + +**Precision** (also known as Positive Predictive Value) measures the accuracy of the model's positive predictions. It answers the question: *"Of all the times the model predicted 'Positive', how many were actually 'Positive'?"* + +## 1. The Mathematical Formula + +Precision is calculated by taking the number of correctly predicted positive results and dividing it by the total number of positive predictions made by the model. + +$$ +\text{Precision} = \frac{TP}{TP + FP} +$$ + +Where: + +* **TP (True Positives):** Correctly predicted positive samples. +* **FP (False Positives):** The "False Alarms"—cases where the model predicted positive, but it was actually negative. + +## 2. When Precision is the Priority + +You should prioritize Precision when the **cost of a False Positive is high**. In other words, you want to be very sure when you cry "wolf." + +### Real-World Example: Spam Filtering + +* **Positive Class:** An email is Spam. +* **False Positive:** A legitimate email from your boss is marked as "Spam." +* **The Goal:** We want high Precision. It is better to let a few spam emails into the Inbox (low Recall) than to accidentally hide an important work email (low Precision). + +```mermaid +graph LR + MAIL["$$\text{Incoming Emails}$$"] + + MAIL --> LEGIT["$$\text{Legitimate Emails}$$"] + MAIL --> SPAM["$$\text{Spam Emails}$$"] + + %% Classifier Decisions + LEGIT --> FP["$$\text{❌ False Positive}$$
$$\text{Legit → Spam}$$"] + SPAM --> TP["$$\text{✅ True Positive}$$
$$\text{Spam → Spam}$$"] + + %% Inbox / Spam Folder + FP --> SPAMBOX["$$\text{Spam Folder}$$"] + TP --> SPAMBOX + + LEGIT --> INBOX["$$\text{Inbox}$$"] + SPAM --> FN["$$\text{False Negative}$$
$$\text{Spam → Inbox}$$"] + FN --> INBOX + + %% Precision Highlight + TP --> PREC["$$\text{Precision} = \frac{TP}{TP + FP}$$"] + FP --> PREC + + FP -.->|"$$\text{High Cost Error}$$"| ALERT["$$\text{Important Email Missed!}$$"] + +``` + +## 3. The Precision-Recall Trade-off + +There is usually a tug-of-war between Precision and Recall. + +* If you make your model extremely "picky" (only predicting positive when it is 99.9% certain), your **Precision will increase**, but you will miss many actual positive cases (**Recall will decrease**). +* Conversely, if your model is very "sensitive" and flags everything that looks remotely suspicious, your **Recall will increase**, but you will get many false alarms (**Precision will decrease**). + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.metrics import precision_score + +# Actual target values (e.g., 1 = Spam, 0 = Inbox) +y_true = [0, 1, 0, 0, 1, 1, 0] + +# Model predictions +y_pred = [0, 0, 1, 0, 1, 1, 0] + +# Calculate Precision +# The 'pos_label' parameter specifies which class is considered "Positive" +score = precision_score(y_true, y_pred) + +print(f"Precision Score: {score:.2f}") +# Output: Precision Score: 0.67 +# (Out of 3 'Spam' predictions, only 2 were actually Spam) + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Minimizes False Alarms:** Crucial for user trust (e.g., avoiding wrong medical diagnoses). | **Ignores Missed Cases:** Doesn't care about the positive cases the model missed completely. | +| **High Specificity:** Focuses purely on the quality of the positive class predictions. | **Can be Manipulated:** A model can have 100% precision by only making one single, very safe prediction. | + +## References + +* **Scikit-Learn Documentation:** [Precision-Recall-F1](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics) +* **StatQuest:** [Precision and Recall Clearly Explained](https://www.youtube.com/watch?v=Kdsp6soqA7o) + +--- + +**Precision tells us how "reliable" our positive predictions are. But what about the cases we missed entirely?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx index e69de29..be1b0e1 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/recall.mdx @@ -0,0 +1,110 @@ +--- +title: "Recall: The Sensitivity Metric" +sidebar_label: Recall +description: "Understanding Recall, its mathematical definition, and why it is critical for minimizing False Negatives." +tags: [machine-learning, model-evaluation, metrics, classification, recall] +--- + +**Recall**, also known as **Sensitivity** or **True Positive Rate (TPR)**, measures the model's ability to find all the positive samples in a dataset. It answers the question: *"Of all the actual positive cases that exist, how many did the model correctly identify?"* + +## 1. The Mathematical Formula + +Recall is calculated by dividing the number of correctly predicted positive results by the total number of actual positives (those we caught + those we missed). + +$$ +\text{Recall} = \frac{TP}{TP + FN} +$$ + +Where: + +* **TP (True Positives):** Correctly predicted positive samples. +* **FN (False Negatives):** The "Misses"—cases that were actually positive, but the model incorrectly labeled them as negative. + +## 2. When Recall is the Top Priority + +You should prioritize Recall when the **cost of a False Negative is extremely high**. In these scenarios, it is better to have a few "false alarms" than to miss a single positive case. + +### Real-World Example: Cancer Detection + +Cancer Detection + +* **Positive Class:** Patient has cancer. +* **False Negative:** The patient has cancer, but the model says they are "Healthy." +* **The Consequence:** The patient does not receive treatment, which could be fatal. +* **The Goal:** We want **100% Recall**. We would rather tell a healthy person they need more tests (False Positive) than tell a sick person they are fine (False Negative). + +```mermaid +graph LR + POP["$$\text{Patients}$$"] + + POP --> C["$$\text{Has Cancer (Positive)}$$"] + POP --> H["$$\text{Healthy (Negative)}$$"] + + %% Model Predictions + C --> FN["$$\text{❌ False Negative}$$
$$\text{Cancer → Healthy}$$"] + C --> TP["$$\text{✅ True Positive}$$
$$\text{Cancer → Cancer}$$"] + + H --> FP["$$\text{False Positive}$$
$$\text{Healthy → Cancer}$$"] + H --> TN["$$\text{True Negative}$$
$$\text{Healthy → Healthy}$$"] + + %% Consequences + FN --> RISK["$$\text{🚨 No Treatment Given}$$"] + RISK --> FATAL["$$\text{Potentially Fatal Outcome}$$"] + + FP --> SAFE["$$\text{Extra Tests / Monitoring}$$"] + SAFE --> OK["$$\text{Acceptable Cost}$$"] + + %% Recall Highlight + TP --> REC["$$\text{Recall} = \frac{TP}{TP + FN}$$"] + FN --> REC + + FN -.->|"$$\text{Most Dangerous Error}$$"| FATAL + +``` + +## 3. The Precision-Recall Inverse Relationship + +As you saw in the [Precision module](./precision), there is an inherent trade-off. + +* **To increase Recall:** You can make your model "less strict." If a bank flags *every* transaction as potentially fraudulent, it will have 100% Recall (it caught every thief), but its Precision will be terrible (it blocked every honest customer too). +* **To increase Precision:** You make the model "more strict," which inevitably leads to missing some positive cases, thereby lowering Recall. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.metrics import recall_score + +# Actual target values (1 = Sick, 0 = Healthy) +y_true = [1, 1, 1, 0, 1, 0, 1] + +# Model predictions +y_pred = [1, 0, 1, 0, 1, 0, 0] + +# Calculate Recall +score = recall_score(y_true, y_pred) + +print(f"Recall Score: {score:.2f}") +# Output: Recall Score: 0.60 +# (We found 3 out of 5 sick people; we missed 2) + +``` + +## 5. Summary Table: Precision vs. Recall + +| Metric | Focus | Goal | Failure Mode | +| --- | --- | --- | --- | +| **Precision** | Quality | "Don't cry wolf." | High Precision misses many real cases. | +| **Recall** | Quantity | "Leave no one behind." | High Recall creates many false alarms. | + +## 6. How to Balance Both? + +If you need a single number that accounts for both the "False Alarms" of Precision and the "Misses" of Recall, you need the **F1-Score**. + +## References + +* **Google Machine Learning Crash Course:** [Recall Metric](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall) +* **Wikipedia:** [Sensitivity and Specificity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity) + +--- + +**Is your model struggling to choose between Precision and Recall? Let's look at the "middle ground" metric.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx index e69de29..007b1e7 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/metrics/roc-auc.mdx @@ -0,0 +1,123 @@ +--- +title: ROC Curve and AUC +sidebar_label: ROC & AUC +description: "Evaluating classifier performance across all thresholds using the Receiver Operating Characteristic and Area Under the Curve." +tags: [machine-learning, model-evaluation, metrics, roc, auc, classification] +--- + +In classification tasks, especially binary classification, it's crucial to evaluate how well a model distinguishes between the positive and negative classes. The **ROC Curve (Receiver Operating Characteristic Curve)** and **AUC (Area Under the Curve)** are powerful tools for this purpose. + +> While metrics like **Accuracy** and **F1-Score** evaluate a model based on a single "cut-off" or threshold (usually 0.5), **ROC** and **AUC** help us evaluate how well the model separates the two classes across **all possible thresholds**. + +:::note +**Prerequisites:** Familiarity with basic classification metrics like **True Positives (TP)**, **False Positives (FP)**, **True Negatives (TN)**, and **False Negatives (FN)**. If you're new to these concepts, consider reviewing the [Confusion Matrix](./confusion-matrix) documentation first. +::: + +## 1. Defining the Terms + +To understand the ROC curve, we need to look at two specific rates: + +1. **True Positive Rate (TPR) / Recall:** $TPR = \frac{TP}{TP + FN}$ + (How many of the actual positives did we catch?) +2. **False Positive Rate (FPR):** $FPR = \frac{FP}{FP + TN}$ + (How many of the actual negatives did we incorrectly flag as positive?) + +## 2. The ROC Curve (Receiver Operating Characteristic) + +The ROC curve is a plot of **TPR (Y-axis)** against **FPR (X-axis)**. + +* **Each point** on the curve represents a TPR/FPR pair corresponding to a particular decision threshold. +* **A "Perfect" Classifier** would have a curve that goes straight up the Y-axis and then across (covering the top-left corner). +* **A Random Classifier** (like flipping a coin) is represented by a 45-degree diagonal line. + +## 3. AUC (Area Under the Curve) + +The **AUC** is the literal area under the ROC curve. It provides an aggregate measure of performance across all possible classification thresholds. + +* **AUC = 1.0:** A perfect model (100% correct predictions). +* **AUC = 0.5:** A useless model (no better than random guessing). +* **AUC = 0.0:** A model that is perfectly wrong (it predicts the exact opposite of the truth). + +**Interpretation:** AUC can be thought of as the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one. + +## 4. Why use ROC-AUC? + +1. **Scale-Invariant:** It measures how well predictions are ranked, rather than their absolute values. +2. **Threshold-Invariant:** It evaluates the model's performance without having to choose a specific threshold. This is great if you haven't decided yet how "picky" the model should be. +3. **Balanced Evaluation:** It is highly effective for comparing different models against each other on the same dataset. + +## 5. Implementation with Scikit-Learn + +To calculate AUC, you usually need the **prediction probabilities** rather than the hard class labels. + +```python +from sklearn.metrics import roc_curve, roc_auc_score +import matplotlib.pyplot as plt + +# 1. Get probability scores from the model +# (Assume model is already trained) +y_probs = model.predict_proba(X_test)[:, 1] + +# 2. Calculate AUC +auc_value = roc_auc_score(y_test, y_probs) +print(f"AUC Score: {auc_value:.4f}") + +# 3. Generate the curve points +fpr, tpr, thresholds = roc_curve(y_test, y_probs) + +# 4. Plotting +plt.plot(fpr, tpr, label=f'AUC = {auc_value:.2f}') +plt.plot([0, 1], [0, 1], 'k--') # Diagonal random line +plt.xlabel('False Positive Rate') +plt.ylabel('True Positive Rate') +plt.title('ROC Curve') +plt.legend() +plt.show() + +``` + +## 6. The Logic of Overlap + +The higher the AUC, the less the "Positive" and "Negative" probability distributions overlap. When the overlap is zero, the model can perfectly distinguish between the two. + +```mermaid +graph TB + TITLE["$$\text{ROC–AUC: Logic of Overlap}$$"] + + %% Distributions + TITLE --> NEG["$$\text{Negative Class}$$
$$P(\hat{y}\mid y=0)$$"] + TITLE --> POS["$$\text{Positive Class}$$
$$P(\hat{y}\mid y=1)$$"] + + %% High Overlap + NEG --> HO["$$\text{High Overlap}$$"] + POS --> HO + HO --> AUC1["$$\text{AUC} \approx 0.5$$"] + AUC1 --> R1["$$\text{Random Guessing}$$"] + + %% Medium Overlap + NEG --> MO["$$\text{Partial Overlap}$$"] + POS --> MO + MO --> AUC2["$$0.7 \le \text{AUC} \le 0.85$$"] + AUC2 --> R2["$$\text{Useful but Imperfect Separation}$$"] + + %% Zero Overlap + NEG --> ZO["$$\text{Zero Overlap}$$"] + POS --> ZO + ZO --> AUC3["$$\text{AUC} = 1.0$$"] + AUC3 --> R3["$$\text{Perfect Classifier}$$"] + + %% Threshold Intuition + R1 -.->|"$$\text{Any Threshold Fails}$$"| TH["$$\text{Decision Threshold}$$"] + R2 -.->|"$$\text{Some Errors}$$"| TH + R3 -.->|"$$\text{Perfect Split}$$"| TH + +``` + +## References + +* **Google Machine Learning Crash Course:** [ROC Curve and AUC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) +* **StatQuest:** [ROC and AUC Clearly Explained](https://www.youtube.com/watch?v=4jRBRDbJemM) + +--- + +**We have mastered classification metrics. But how do we evaluate a model that predicts continuous numbers, like house prices or stock trends?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx b/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx index e69de29..10f3e2f 100644 --- a/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx +++ b/docs/machine-learning/machine-learning-core/model-evaluation/why-evaluation-matters.mdx @@ -0,0 +1,78 @@ +--- +title: Why Model Evaluation Matters +sidebar_label: Why Evaluation Matters +description: "Understanding the difference between training performance and real-world reliability." +tags: [machine-learning, model-evaluation, metrics, generalization] +--- + +Building a machine learning model is only half the battle. The most dangerous mistake a Data Scientist can make is assuming that a model with **99% accuracy** on the training data will perform just as well in the real world. + +**Model Evaluation** is the process of using different metrics and validation strategies to understand how well your model generalizes to data it has never seen before. + +## 1. The Trap of "Memorization" (Overfitting) + +If you give a student the exact same questions from their textbook on their final exam, they might get a 100% just by memorizing the answers. However, if you give them a new problem and they fail, they haven't actually learned the subject. + +In Machine Learning, this is called **Overfitting**. + +* **Training Error:** How well the model performs on the data it studied. +* **Generalization Error:** How well the model performs on new, unseen data. + +**The Goal:** We want to minimize the Generalization Error, not just the Training Error. + +## 2. The Bias-Variance Tradeoff + +Every model's error can be broken down into two main components: + +### Bias (Underfitting) +The error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs. +* *Analogy:* Trying to fit a straight line through a curved set of points. + +### Variance (Overfitting) + +The error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data. +* *Analogy:* Following every single data point so closely that the model becomes "wiggly." + +```mermaid +graph LR + A[Low Bias / High Variance] --- B((Optimal Balance)) + B --- C[High Bias / Low Variance] + + style B fill:#e8f5e9,stroke:#2e7d32,stroke-width:4px,color:#333 + +``` + +## 3. The Evaluation Workflow + +To evaluate a model properly, we never use the same data for training and testing. We typically split our dataset into three parts: + +| Split | Purpose | +| --- | --- | +| **Training Set** | Used to teach the model (The "Textbook"). | +| **Validation Set** | Used to tune hyperparameters and pick the best model version. | +| **Test Set** | The "Final Exam." Used only once at the very end to see real-world performance. | + +## 4. Why Accuracy Isn't Enough + +Imagine a model designed to detect a very rare disease that only affects 1 in 1,000 people. +If the model simply predicts **"Healthy"** for everyone, it will be **99.9% accurate**. + +However, it is a **useless model** because it failed to find the 1 person who was actually sick. This is why we need more advanced metrics like: + +* **Precision & Recall** (For Classification) +* **Mean Absolute Error** (For Regression) +* **F1-Score** (For Imbalanced Data) + +## 5. The Evaluation Roadmap + +In the upcoming chapters, we will dive deep into specific evaluation tools: + +1. **Confusion Matrices:** Seeing exactly where your classifier is getting confused. +2. **ROC & AUC:** Understanding the trade-off between sensitivity and specificity. +3. **Cross-Validation:** Making the most of limited data. +4. **Regression Metrics:** Measuring the "distance" between reality and prediction. + +## References + +* **Google Machine Learning Crash Course:** [Generalization](https://developers.google.com/machine-learning/crash-course/generalization/video-lecture) +* **StatQuest:** [Bias and Variance](https://www.youtube.com/watch?v=EuBBz3bI-aA) \ No newline at end of file diff --git a/static/img/tutorials/ml/cancer-detection.jpg b/static/img/tutorials/ml/cancer-detection.jpg new file mode 100644 index 0000000..6f4cba3 Binary files /dev/null and b/static/img/tutorials/ml/cancer-detection.jpg differ