Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "Dimensionality Reduction: PCA & LDA"
sidebar_label: Dimensionality Reduction
description: "Reducing feature complexity while preserving information: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)."
tags: [data-science, dimensionality-reduction, pca, lda, feature-selection, unsupervised-learning]
---

In Machine Learning, more data isn't always better. The **Curse of Dimensionality** refers to the phenomenon where, as the number of features (dimensions) increases, the volume of the space increases so fast that the available data becomes sparse. This leads to overfitting and massive computational costs.

Dimensionality Reduction aims to project high-dimensional data into a lower-dimensional space while retaining as much meaningful information as possible.

## 1. Why Reduce Dimensions?

1. **Visualization:** We cannot visualize data in 10 dimensions. Reducing it to 2D or 3D allows us to see clusters and patterns.
2. **Performance:** Fewer features mean faster training and lower memory usage.
3. **Noise Reduction:** By removing "redundant" features, we help the model focus on the most important signals.
4. **Multicollinearity:** It helps handle features that are highly correlated with each other.

## 2. Principal Component Analysis (PCA)

PCA is an **unsupervised** technique that finds the directions (Principal Components) where the variance of the data is maximized.

* **Principal Component 1 (PC1):** The direction that captures the most spread in the data.
* **Principal Component 2 (PC2):** The direction perpendicular to PC1 that captures the next most spread.

**Key Concept: Explained Variance**
In PCA, we often look at the "Scree Plot" to decide how many dimensions to keep. We typically aim to keep enough components to explain **95%** of the total variance.

$$
Var(PC_1) > Var(PC_2) > ... > Var(PC_n)
$$

## 3. Linear Discriminant Analysis (LDA)

While PCA cares about *variance*, LDA is a **supervised** technique that cares about **separability**.

* **Goal:** Project data onto a new axis that maximizes the distance between the means of different classes and minimizes the variance within each class.
* **Usage:** Often used as a preprocessing step for classification tasks.

## 4. PCA vs. LDA: A Comparison

| Feature | PCA | LDA |
| :--- | :--- | :--- |
| **Type** | Unsupervised (Ignores labels) | Supervised (Uses labels) |
| **Objective** | Maximize variance | Maximize class separability |
| **Application** | Feature compression, visualization | Preprocessing for classification |
| **Limit** | Max components = Total features | Max components = Number of classes - 1 |

```mermaid
graph LR
subgraph Goal_PCA [PCA Objective]
V[Max Variance]
end
subgraph Goal_LDA [LDA Objective]
S[Max Class Separation]
end
Data[High Dimensional Data] --> PCA
Data --> LDA
PCA --> Goal_PCA
LDA --> Goal_LDA

```

## 5. Implementation with Scikit-Learn

```python
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# 1. PCA: Reducing to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Explained Variance: {pca.explained_variance_ratio_}")

# 2. LDA: Reducing based on target 'y'
lda = LDA(n_components=1)
X_lda = lda.fit_transform(X_scaled, y)

```

:::warning Critical Note
Always perform **Feature Scaling** (Standardization) before applying PCA. Because PCA maximizes variance, a feature with a large scale (like 'Salary') will dominate the components even if it isn't the most important.
:::

## 6. Other Notable Techniques

* **t-SNE (t-Distributed Stochastic Neighbor Embedding):** Excellent for 2D/3D visualization of non-linear clusters.
* **UMAP (Uniform Manifold Approximation and Projection):** Faster and often preserves more global structure than t-SNE.
* **Autoencoders:** A type of Neural Network used to learn "bottleneck" representations of data.

## References for More Details

* **[StatQuest - PCA Clearly Explained](https://www.youtube.com/watch?v=FgakZw6K1QQ):** Visual learners wanting to understand the intuition behind the math.

* **[Scikit-Learn - Decomposition Module](https://scikit-learn.org/stable/modules/decomposition.html):** Technical documentation on PCA, Factor Analysis, and Dictionary Learning.

---

**You have now completed the Data Engineering and Preprocessing journey! You have learned how to collect data, clean it, engineer features, and compress them. You are finally ready to build and train your first Machine Learning model.**
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: The Art of Feature Engineering
sidebar_label: Feature Engineering
description: "A comprehensive guide to creating, transforming, and selecting features to maximize Machine Learning model performance."
tags: [feature-engineering, data-science, preprocessing, python, pandas]
---

:::note
"Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — **Andrew Ng**
:::

Feature engineering is the process of using domain knowledge to extract new variables from raw data that help machine learning algorithms learn faster and predict more accurately.

## 1. Transforming Numerical Features

Numerical data often needs to be reshaped to satisfy the mathematical assumptions of algorithms like Linear Regression or Neural Networks.

### A. Scaling (Normalization & Standardization)
Most models are sensitive to the magnitude of numbers. If one feature is "Salary" ($50,000$) and another is "Age" ($25$), the model might think Salary is $2,000$ times more important simply because the numbers are larger.

* **Standardization (Z-score):** Centers data at $\mu = 0$ with $\sigma = 1$.
* **Normalization (Min-Max):** Rescales data to a fixed range, usually $[0, 1]$.

### B. Binning (Discretization)
Sometimes the exact value isn't as important as the "group" it belongs to.
* **Example:** Converting "Age" into "Child," "Teen," "Adult," and "Senior."
* **Why?** It can help handle outliers and capture non-linear relationships.

## 2. Encoding Categorical Features

Machine Learning models are mathematical equations; they cannot multiply a weight by "London" or "Paris." We must convert text into numbers.

### A. One-Hot Encoding
Creates a new binary column ($0$ or $1$) for every unique category.
* **Best for:** Nominal data (no inherent order, like "Color" or "City").

### B. Ordinal Encoding
Assigns an integer to each category based on rank.
* **Best for:** Ordinal data (where order matters, like "Low," "Medium," "High").

## 3. Creating New Features (Feature Construction)

This is where domain expertise shines. You combine existing columns to create a more powerful "signal."

* **Interaction Features:** If you have `Width` and `Length`, creating `Area = Width * Length` might be more predictive for housing prices.
* **Ratios:** In finance, `Debt-to-Income Ratio` is often more useful than having `Debt` and `Income` as separate features.
* **Polynomial Features:** Creating $x^2$ or $x^3$ to capture curved relationships in the data.

```mermaid
graph LR
A[Feature A: Price] --> C{Logic}
B[Feature B: SqFt] --> C
C --> New[New Feature: Price_per_SqFt]
style New fill:#f3e5f5,stroke:#7b1fa2,color:#333

```

## 4. Handling DateTime Features

Raw timestamps (e.g., `2023-10-27 14:30:00`) are useless to a model. We must extract the cyclical patterns:

* **Time of Day:** Morning, Afternoon, Evening, Night.
* **Day of Week:** Is it a weekend? (Useful for retail/traffic prediction).
* **Seasonality:** Month or Quarter (Useful for sales forecasting).

## 5. Text Feature Engineering (NLP Basics)

To turn "Natural Language" into features, we use techniques like:

1. **Bag of Words (BoW):** Counting the frequency of each word.
2. **TF-IDF:** Weighting words by how unique they are to a specific document.
3. **Word Embeddings:** Converting words into dense vectors that capture meaning (e.g., Word2Vec).

## 6. Feature Selection: "Less is More"

Having too many features leads to the **Curse of Dimensionality**, causing the model to overfit on noise.

* **Filter Methods:** Using statistical tests (like Correlation) to drop irrelevant features.
* **Wrapper Methods:** Training the model on different subsets of features to find the best combo (e.g., Recursive Feature Elimination).
* **Embedded Methods:** Models that perform feature selection during training (e.g., LASSO Regression uses regularization to zero out useless weights).

## 7. The Golden Rules of Feature Engineering

1. **Don't Leak Information:** Never use the `Target` variable to create a feature (this is called Data Leakage).
2. **Think Cyclically:** For time or angles, use circular transforms () so the model knows is close to .
3. **Visualize First:** Use scatter plots to see if a feature actually correlates with your target before spending hours engineering it.

## References for More Details

* **[Feature Engineering for Machine Learning (Alice Zheng)](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/):** Deep mathematical intuition.

* **[Scikit-Learn Preprocessing Module](https://scikit-learn.org/stable/modules/preprocessing.html):** Practical code implementation for scaling and encoding.

---

**Now that your features are engineered and ready, we need to ensure the data is mathematically balanced so no single feature dominates the learning process.**
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: "Feature Scaling: Normalization & Standardization"
sidebar_label: Feature Scaling
description: "Mastering the techniques used to harmonize feature scales, ensuring faster convergence and better model accuracy."
tags: [data-cleaning, preprocessing, scaling, normalization, standardization, machine-learning]
---

Imagine you are training a model to predict house prices. You have two features:
1. **Number of Bedrooms:** Range 1–5
2. **Square Footage:** Range 500–5000

Because 500 is much larger than 5, a model might "think" square footage is 100 times more important than bedrooms. **Feature Scaling** levels the playing field so that the model treats all features fairly based on their information, not their magnitude.

## 1. Why do we scale?

Scaling is mandatory for specific types of algorithms:
* **Distance-Based Algorithms:** KNN, K-Means, and SVM rely on Euclidean distance. Larger scales distort these distances.
* **Gradient Descent-Based Algorithms:** Neural Networks and Logistic Regression converge (find the answer) much faster when the "loss landscape" is spherical rather than elongated.
* **Principal Component Analysis (PCA):** Features with higher variance will dominate the principal components.

## 2. Standardization (Z-Score Normalization)

Standardization transforms data so that it has a **mean of 0** and a **standard deviation of 1**.

**Formula:**

$$
z = \frac{x - \mu}{\sigma}
$$

* **$\mu$:** Mean of the feature.
* **$\sigma$:** Standard deviation of the feature.

**When to use:** Use this when your data follows a **Gaussian (Normal) Distribution**. It is robust to outliers compared to Min-Max scaling and is the default choice for most ML algorithms (SVM, Linear Regression).

## 3. Normalization (Min-Max Scaling)

Normalization rescales the data into a fixed range, usually **[0, 1]**.

**Formula:**

$$
x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}
$$

**When to use:** Use this when you do **not** know the distribution of your data or when you know there are no significant outliers. It is highly used in **Image Processing** (scaling pixel values from 0–255 to 0–1) and **Neural Networks**.

:::warning
Min-Max scaling is very sensitive to outliers. A single outlier at 1,000,000 will "squash" all your normal data points into a tiny range near 0.
:::

## 4. Robust Scaling

If your dataset contains many outliers that you cannot remove, use the **Robust Scaler**. Instead of using the mean and standard deviation, it uses the **Median** and the **Interquartile Range (IQR)**.

**Formula:**

$$
x_{robust} = \frac{x - \text{median}}{Q_3 - Q_1}
$$


## 5. Comparison Table

| Method | Range | Distribution | Outlier Sensitivity |
| :--- | :--- | :--- | :--- |
| **Standardization** | $\approx$ [-3, 3] | Becomes $\mu=0, \sigma=1$ | Low (Robust) |
| **Normalization** | [0, 1] or [-1, 1] | Squashed into range | **High** |
| **Robust Scaling** | Varies | Median centered at 0 | **Very Low** |

## 6. Implementation with Scikit-Learn

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

data = [[100, 0.001], [8, 0.05], [50, 0.005], [88, 0.07]]

# 1. Standardization
std_scaler = StandardScaler()
std_data = std_scaler.fit_transform(data)

# 2. Normalization
min_max = MinMaxScaler()
norm_data = min_max.fit_transform(data)

```

## 7. The Golden Rule: Fit on Train, Transform on Test

One of the most common mistakes in Data Engineering is "Data Leakage." When scaling, you must:

1. **Fit** the scaler only on the **Training Set**.
2. **Transform** the **Test Set** using the parameters () learned from the Training Set.

```mermaid
graph TD
Data[Full Dataset] --> Split{Split}
Split --> Train[Training Set]
Split --> Test[Test Set]
Train --> Fit[Scaler.fit]
Fit --> Trans1[Scaler.transform Train]
Fit --> Trans2[Scaler.transform Test]
style Fit fill:#f3e5f5,stroke:#7b1fa2,color:#333

```

## References for More Details

* **[Scikit-Learn Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling):** Implementation details and alternative scalers (MaxAbsScaler).

* **[About Feature Scaling (Article)](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html):** A deep mathematical dive into why scaling matters for specific algorithms.


---

**Now that your features are cleaned, engineered, and scaled, you have a high-quality dataset. But before you train a model, you need to ensure you haven't given it too much information or too little.**
Loading