Skip to content

Commit 0d47ba8

Browse files
committed
added more content for ml docs
1 parent 0650088 commit 0d47ba8

File tree

9 files changed

+1012
-0
lines changed

9 files changed

+1012
-0
lines changed
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
title: "Loading Data in Scikit-Learn"
3+
sidebar_label: Data Loading
4+
description: "How to use Scikit-Learn's built-in datasets, fetchers, and external loaders to prepare data for modeling."
5+
tags: [scikit-learn, data-loading, python, machine-learning, datasets]
6+
---
7+
8+
Before you can train a model, you need to get your data into a format that Scikit-Learn understands. Scikit-Learn works primarily with **NumPy arrays** or **Pandas DataFrames**, but it also provides built-in tools to help you get started quickly.
9+
10+
## 1. The Scikit-Learn Data Format
11+
12+
Regardless of how you load your data, Scikit-Learn expects two main components:
13+
14+
1. **The Feature Matrix ($X$):** A 2D array of shape `(n_samples, n_features)`.
15+
2. **The Target Vector ($y$):** A 1D array of shape `(n_samples)` containing the labels or values you want to predict.
16+
17+
## 2. Built-in "Toy" Datasets
18+
19+
Scikit-Learn comes bundled with small datasets that require no internet connection. These are perfect for testing your code or learning new algorithms.
20+
21+
* `load_iris()`: Classic classification dataset (flowers).
22+
* `load_diabetes()`: Regression dataset.
23+
* `load_digits()`: Classification dataset (handwritten digits).
24+
25+
```python
26+
from sklearn.datasets import load_iris
27+
28+
# Load the dataset
29+
iris = load_iris()
30+
31+
# Access data and labels
32+
X = iris.data
33+
y = iris.target
34+
35+
print(f"Features: {iris.feature_names}")
36+
print(f"Target Names: {iris.target_names}")
37+
38+
```
39+
40+
## 3. Fetching Large Real-World Datasets
41+
42+
For larger datasets, Scikit-Learn provides "fetchers" that download data from the internet and cache it locally in your `~/scikit_learn_data` folder.
43+
44+
* `fetch_california_housing()`: Predict median house values.
45+
* `fetch_20newsgroups()`: Text dataset for NLP.
46+
* `fetch_lfw_people()`: Labeled Faces in the Wild (for face recognition).
47+
48+
```python
49+
from sklearn.datasets import fetch_california_housing
50+
51+
housing = fetch_california_housing()
52+
print(f"Dataset shape: {housing.data.shape}")
53+
54+
```
55+
56+
## 4. Loading from External Sources
57+
58+
In a professional environment, you will rarely use the built-in datasets. You will likely load data from **CSVs**, **SQL Databases**, or **Pandas DataFrames**.
59+
60+
### From Pandas to Scikit-Learn
61+
62+
Scikit-Learn is designed to be "Pandas-friendly." You can pass DataFrames directly into models.
63+
64+
```python
65+
import pandas as pd
66+
from sklearn.linear_model import LinearRegression
67+
68+
# Load your own CSV
69+
df = pd.read_csv('my_data.csv')
70+
71+
# Split into X and y
72+
X = df[['feature1', 'feature2']] # Select specific columns
73+
y = df['target_column']
74+
75+
# Train model directly
76+
model = LinearRegression().fit(X, y)
77+
78+
```
79+
80+
## 5. Generating Synthetic Data
81+
82+
Sometimes you need to create "fake" data to test how an algorithm handles specific scenarios (like high noise or non-linear patterns).
83+
84+
```python
85+
from sklearn.datasets import make_blobs, make_moons
86+
87+
# Create 3 distinct clusters for a classification task
88+
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)
89+
90+
```
91+
92+
## 6. Understanding the "Bunch" Object
93+
94+
When you use `load_*` or `fetch_*`, Scikit-Learn returns a **`Bunch` object**. This is essentially a dictionary that contains:
95+
96+
* `.data`: The feature matrix.
97+
* `.target`: The labels.
98+
* `.feature_names`: The names of the columns.
99+
* `.DESCR`: A full text description of where the data came from.
100+
101+
:::tip
102+
Use `as_frame=True` in your loader to get the data returned as a Pandas DataFrame immediately: `data = load_iris(as_frame=True).frame`
103+
:::
104+
105+
## References for More Details
106+
107+
* **[Sklearn Dataset Loading Guide](https://scikit-learn.org/stable/datasets.html):** Exploring all 20+ available fetchers and loaders.
108+
* **[OpenML Integration](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html):** Accessing thousands of community-uploaded datasets via `fetch_openml`.
109+
110+
---
111+
112+
**Now that you can load data, the next step is to ensure it's in the right shape and split correctly for training and testing.**
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
---
2+
title: Data Preparation in Scikit-Learn
3+
sidebar_label: Data Preparation
4+
description: "Transforming raw data into model-ready features using Scikit-Learn's preprocessing and imputation tools."
5+
tags: [scikit-learn, preprocessing, encoding, scaling, imputation]
6+
---
7+
8+
Before feeding data into an algorithm, it must be cleaned and transformed. Scikit-Learn provides a robust suite of **Transformers**—classes that follow a standard `.fit()` and `.transform()` API—to automate this work.
9+
10+
## 1. Handling Missing Values
11+
12+
Machine Learning models cannot handle `NaN` (Not a Number) or `null` values. The `SimpleImputer` class helps fill these gaps.
13+
14+
```python
15+
from sklearn.impute import SimpleImputer
16+
import numpy as np
17+
18+
# Sample data with missing values
19+
X = [[1, 2], [np.nan, 3], [7, 6]]
20+
21+
# strategy='mean', 'median', 'most_frequent', or 'constant'
22+
imputer = SimpleImputer(strategy='mean')
23+
X_filled = imputer.fit_transform(X)
24+
25+
```
26+
27+
## 2. Encoding Categorical Data
28+
29+
Computers understand numbers, not words. If you have a column for "City" (New York, Paris, Tokyo), you must encode it.
30+
31+
### A. One-Hot Encoding (Nominal)
32+
33+
Creates a new binary column for each category. Best for data without a natural order.
34+
35+
```python
36+
from sklearn.preprocessing import OneHotEncoder
37+
38+
encoder = OneHotEncoder(sparse_output=False)
39+
cities = [['New York'], ['Paris'], ['Tokyo']]
40+
encoded_cities = encoder.fit_transform(cities)
41+
42+
```
43+
44+
### B. Ordinal Encoding (Ranked)
45+
46+
Converts categories into integers (). Use this when the order matters (e.g., Small, Medium, Large).
47+
48+
## 3. Feature Scaling
49+
50+
As discussed in our [Data Engineering module](/tutorial/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling), scaling ensures that features with large ranges (like Salary) don't overpower features with small ranges (like Age).
51+
52+
### Standardization (`StandardScaler`)
53+
54+
Rescales data to have a mean of and a standard deviation of .
55+
56+
$$
57+
z = \frac{x - \mu}{\sigma}
58+
$$
59+
60+
### Normalization (`MinMaxScaler`)
61+
62+
Rescales data to a fixed range, usually .
63+
64+
```python
65+
from sklearn.preprocessing import StandardScaler
66+
67+
scaler = StandardScaler()
68+
X_scaled = scaler.fit_transform(X_filled)
69+
70+
```
71+
72+
## 4. The `fit` vs `transform` Rule
73+
74+
One of the most important concepts in Scikit-Learn is the distinction between these two methods:
75+
76+
* **`.fit()`**: The transformer calculates the parameters (e.g., the mean and standard deviation of your data). **Only do this on Training data.**
77+
* **`.transform()`**: The transformer applies those calculated parameters to the data.
78+
* **`.fit_transform()`**: Does both in one step.
79+
80+
```mermaid
81+
graph TD
82+
Train[Training Data] --> Fit[Fit: Learn Mean/Std]
83+
Fit --> TransTrain[Transform Training Data]
84+
Fit --> TransTest[Transform Test Data]
85+
86+
style Fit fill:#f3e5f5,stroke:#7b1fa2,color:#333
87+
88+
```
89+
90+
:::warning
91+
Never `fit` on your Test data. This leads to **Data Leakage**, where your model "cheats" by seeing the distribution of the test set during training.
92+
:::
93+
94+
## 5. ColumnTransformer: Selective Processing
95+
96+
In real datasets, you have a mix of types: some columns need scaling, others need encoding, and some need nothing. `ColumnTransformer` allows you to apply different prep steps to different columns simultaneously.
97+
98+
```python
99+
from sklearn.compose import ColumnTransformer
100+
101+
preprocessor = ColumnTransformer(
102+
transformers=[
103+
('num', StandardScaler(), ['age', 'income']),
104+
('cat', OneHotEncoder(), ['city', 'gender'])
105+
])
106+
107+
# X_processed = preprocessor.fit_transform(df)
108+
109+
```
110+
111+
---
112+
113+
## References for More Details
114+
115+
* **[Scikit-Learn Preprocessing Guide](https://scikit-learn.org/stable/modules/preprocessing.html):** Discovering advanced transformers like `PowerTransformer` or `PolynomialFeatures`.
116+
* **[Imputing Missing Values](https://scikit-learn.org/stable/modules/impute.html):** Learning about `IterativeImputer` (MICE) and `KNNImputer`.
117+
118+
---
119+
120+
**Manual data preparation can get messy and hard to replicate. To solve this, Scikit-Learn uses a powerful tool to chain all these steps together into a single object.**
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: Hyperparameter Tuning
3+
sidebar_label: Hyperparameter Tuning
4+
description: "Optimizing model performance using GridSearchCV, RandomizedSearchCV, and Halving techniques."
5+
tags: [scikit-learn, hyperparameter-tuning, grid-search, optimization, model-selection]
6+
---
7+
8+
In Machine Learning, there is a crucial difference between **Parameters** and **Hyperparameters**:
9+
10+
* **Parameters:** Learned by the model during training (e.g., weights in a regression or coefficients in a neural network).
11+
* **Hyperparameters:** Set by the engineer *before* training starts (e.g., the depth of a Decision Tree or the number of neighbors in KNN).
12+
13+
**Hyperparameter Tuning** is the automated search for the best combination of these settings to minimize error.
14+
15+
## 1. Why Tune Hyperparameters?
16+
17+
Most algorithms come with default settings that work reasonably well, but they are rarely optimal for your specific data. Proper tuning can often bridge the gap between a mediocre model and a state-of-the-art one.
18+
19+
## 2. GridSearchCV: The Exhaustive Search
20+
21+
`GridSearchCV` takes a predefined list of values for each hyperparameter and tries **every possible combination**.
22+
23+
* **Pros:** Guaranteed to find the best combination within the provided grid.
24+
* **Cons:** Computationally expensive. If you have 5 parameters with 5 values each, you must train the model $5^5 = 3,125$ times.
25+
26+
```python
27+
from sklearn.model_selection import GridSearchCV
28+
from sklearn.ensemble import RandomForestClassifier
29+
30+
param_grid = {
31+
'n_estimators': [50, 100, 200],
32+
'max_depth': [None, 10, 20],
33+
'min_samples_split': [2, 5]
34+
}
35+
36+
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
37+
grid_search.fit(X_train, y_train)
38+
39+
print(f"Best Parameters: {grid_search.best_params_}")
40+
41+
```
42+
43+
## 3. RandomizedSearchCV: The Efficient Alternative
44+
45+
Instead of trying every combination, `RandomizedSearchCV` picks a fixed number of random combinations from a distribution.
46+
47+
* **Pros:** Much faster than GridSearch. It often finds a result almost as good as GridSearch in a fraction of the time.
48+
* **Cons:** Not guaranteed to find the absolute best "peak" in the parameter space.
49+
50+
```python
51+
from sklearn.model_selection import RandomizedSearchCV
52+
from scipy.stats import randint
53+
54+
param_dist = {
55+
'n_estimators': randint(50, 500),
56+
'max_depth': [None, 10, 20, 30, 40, 50],
57+
}
58+
59+
random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=20, cv=5)
60+
random_search.fit(X_train, y_train)
61+
62+
```
63+
64+
## 4. Advanced: Successive Halving
65+
66+
For massive datasets, even Random Search is slow. Scikit-Learn offers **HalvingGridSearch**. It trains all combinations on a small amount of data, throws away the bottom 50%, and keeps "promising" candidates for the next round with more data.
67+
68+
```mermaid
69+
graph TD
70+
S1[Round 1: 100 candidates, 10% data] --> S2[Round 2: 50 candidates, 20% data]
71+
S2 --> S3[Round 3: 25 candidates, 40% data]
72+
S3 --> S4[Final Round: Best candidates, 100% data]
73+
74+
style S1 fill:#fff3e0,stroke:#ef6c00,color:#333
75+
style S4 fill:#e8f5e9,stroke:#2e7d32,color:#333
76+
77+
```
78+
79+
## 5. Avoiding the Validation Trap
80+
81+
If you tune your hyperparameters using the **Test Set**, you are "leaking" information. The model will look great on that test set, but fail on new data.
82+
83+
**The Solution:** Use **Nested Cross-Validation** or ensure that your `GridSearchCV` only uses the **Training Set** (it will internally split the training data into smaller validation folds).
84+
85+
```mermaid
86+
graph LR
87+
FullData[Full Dataset] --> Split{Initial Split}
88+
Split --> Train[Training Set]
89+
Split --> Test[Hold-out Test Set]
90+
91+
subgraph Optimization [GridSearch with Internal CV]
92+
Train --> CV1[Fold 1]
93+
Train --> CV2[Fold 2]
94+
Train --> CV3[Fold 3]
95+
end
96+
97+
Optimization --> BestModel[Best Hyperparameters]
98+
BestModel --> FinalEval[Final Evaluation on Test Set]
99+
100+
```
101+
102+
## 6. Tuning Strategy Summary
103+
104+
| Method | Best for... | Resource Usage |
105+
| --- | --- | --- |
106+
| **Manual Tuning** | Initial exploration / small models | Low |
107+
| **GridSearch** | Small number of parameters | High |
108+
| **RandomSearch** | Many parameters / large search space | Moderate |
109+
| **Halving Search** | Large datasets / expensive training | Low-Moderate |
110+
111+
## References for More Details
112+
113+
* **[Sklearn Tuning Guide](https://scikit-learn.org/stable/modules/grid_search.html):** Deep dive into `HalvingGridSearchCV` and custom scoring.
114+
115+
---
116+
117+
**Now that your model is fully optimized and tuned, it's time to evaluate its performance using metrics that go beyond simple "Accuracy."**

0 commit comments

Comments
 (0)