|
| 1 | +--- |
| 2 | +title: Hyperparameter Tuning |
| 3 | +sidebar_label: Hyperparameter Tuning |
| 4 | +description: "Optimizing model performance using GridSearchCV, RandomizedSearchCV, and Halving techniques." |
| 5 | +tags: [scikit-learn, hyperparameter-tuning, grid-search, optimization, model-selection] |
| 6 | +--- |
| 7 | + |
| 8 | +In Machine Learning, there is a crucial difference between **Parameters** and **Hyperparameters**: |
| 9 | + |
| 10 | +* **Parameters:** Learned by the model during training (e.g., weights in a regression or coefficients in a neural network). |
| 11 | +* **Hyperparameters:** Set by the engineer *before* training starts (e.g., the depth of a Decision Tree or the number of neighbors in KNN). |
| 12 | + |
| 13 | +**Hyperparameter Tuning** is the automated search for the best combination of these settings to minimize error. |
| 14 | + |
| 15 | +## 1. Why Tune Hyperparameters? |
| 16 | + |
| 17 | +Most algorithms come with default settings that work reasonably well, but they are rarely optimal for your specific data. Proper tuning can often bridge the gap between a mediocre model and a state-of-the-art one. |
| 18 | + |
| 19 | +## 2. GridSearchCV: The Exhaustive Search |
| 20 | + |
| 21 | +`GridSearchCV` takes a predefined list of values for each hyperparameter and tries **every possible combination**. |
| 22 | + |
| 23 | +* **Pros:** Guaranteed to find the best combination within the provided grid. |
| 24 | +* **Cons:** Computationally expensive. If you have 5 parameters with 5 values each, you must train the model $5^5 = 3,125$ times. |
| 25 | + |
| 26 | +```python |
| 27 | +from sklearn.model_selection import GridSearchCV |
| 28 | +from sklearn.ensemble import RandomForestClassifier |
| 29 | + |
| 30 | +param_grid = { |
| 31 | + 'n_estimators': [50, 100, 200], |
| 32 | + 'max_depth': [None, 10, 20], |
| 33 | + 'min_samples_split': [2, 5] |
| 34 | +} |
| 35 | + |
| 36 | +grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) |
| 37 | +grid_search.fit(X_train, y_train) |
| 38 | + |
| 39 | +print(f"Best Parameters: {grid_search.best_params_}") |
| 40 | + |
| 41 | +``` |
| 42 | + |
| 43 | +## 3. RandomizedSearchCV: The Efficient Alternative |
| 44 | + |
| 45 | +Instead of trying every combination, `RandomizedSearchCV` picks a fixed number of random combinations from a distribution. |
| 46 | + |
| 47 | +* **Pros:** Much faster than GridSearch. It often finds a result almost as good as GridSearch in a fraction of the time. |
| 48 | +* **Cons:** Not guaranteed to find the absolute best "peak" in the parameter space. |
| 49 | + |
| 50 | +```python |
| 51 | +from sklearn.model_selection import RandomizedSearchCV |
| 52 | +from scipy.stats import randint |
| 53 | + |
| 54 | +param_dist = { |
| 55 | + 'n_estimators': randint(50, 500), |
| 56 | + 'max_depth': [None, 10, 20, 30, 40, 50], |
| 57 | +} |
| 58 | + |
| 59 | +random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=20, cv=5) |
| 60 | +random_search.fit(X_train, y_train) |
| 61 | + |
| 62 | +``` |
| 63 | + |
| 64 | +## 4. Advanced: Successive Halving |
| 65 | + |
| 66 | +For massive datasets, even Random Search is slow. Scikit-Learn offers **HalvingGridSearch**. It trains all combinations on a small amount of data, throws away the bottom 50%, and keeps "promising" candidates for the next round with more data. |
| 67 | + |
| 68 | +```mermaid |
| 69 | +graph TD |
| 70 | + S1[Round 1: 100 candidates, 10% data] --> S2[Round 2: 50 candidates, 20% data] |
| 71 | + S2 --> S3[Round 3: 25 candidates, 40% data] |
| 72 | + S3 --> S4[Final Round: Best candidates, 100% data] |
| 73 | + |
| 74 | + style S1 fill:#fff3e0,stroke:#ef6c00,color:#333 |
| 75 | + style S4 fill:#e8f5e9,stroke:#2e7d32,color:#333 |
| 76 | +
|
| 77 | +``` |
| 78 | + |
| 79 | +## 5. Avoiding the Validation Trap |
| 80 | + |
| 81 | +If you tune your hyperparameters using the **Test Set**, you are "leaking" information. The model will look great on that test set, but fail on new data. |
| 82 | + |
| 83 | +**The Solution:** Use **Nested Cross-Validation** or ensure that your `GridSearchCV` only uses the **Training Set** (it will internally split the training data into smaller validation folds). |
| 84 | + |
| 85 | +```mermaid |
| 86 | +graph LR |
| 87 | + FullData[Full Dataset] --> Split{Initial Split} |
| 88 | + Split --> Train[Training Set] |
| 89 | + Split --> Test[Hold-out Test Set] |
| 90 | + |
| 91 | + subgraph Optimization [GridSearch with Internal CV] |
| 92 | + Train --> CV1[Fold 1] |
| 93 | + Train --> CV2[Fold 2] |
| 94 | + Train --> CV3[Fold 3] |
| 95 | + end |
| 96 | + |
| 97 | + Optimization --> BestModel[Best Hyperparameters] |
| 98 | + BestModel --> FinalEval[Final Evaluation on Test Set] |
| 99 | +
|
| 100 | +``` |
| 101 | + |
| 102 | +## 6. Tuning Strategy Summary |
| 103 | + |
| 104 | +| Method | Best for... | Resource Usage | |
| 105 | +| --- | --- | --- | |
| 106 | +| **Manual Tuning** | Initial exploration / small models | Low | |
| 107 | +| **GridSearch** | Small number of parameters | High | |
| 108 | +| **RandomSearch** | Many parameters / large search space | Moderate | |
| 109 | +| **Halving Search** | Large datasets / expensive training | Low-Moderate | |
| 110 | + |
| 111 | +## References for More Details |
| 112 | + |
| 113 | +* **[Sklearn Tuning Guide](https://scikit-learn.org/stable/modules/grid_search.html):** Deep dive into `HalvingGridSearchCV` and custom scoring. |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +**Now that your model is fully optimized and tuned, it's time to evaluate its performance using metrics that go beyond simple "Accuracy."** |
0 commit comments