Skip to content

A pure Python implementation of a Decision Tree Classifier built from scratch using NumPy. Features manual calculation of Entropy & Gini Impurity for node splitting, with a comparative analysis against Scikit-Learn.

Notifications You must be signed in to change notification settings

mariamashraf731/Decision-Tree-Classifier-from-Scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŒณ Decision Tree Classifier from Scratch

Language Libraries Topic

๐Ÿ“Œ Project Overview

This repository contains a purely Pythonic implementation of a Decision Tree Classifier built from scratch without relying on high-level ML libraries for the core logic.

The project demonstrates a deep understanding of tree-based algorithms by manually implementing:

  • Splitting Criteria: Entropy (Information Gain) and Gini Impurity.
  • Tree Construction: Recursive partitioning for both categorical and numerical features.
  • Prediction Logic: Traversing the learned tree structure to classify new samples.

It also includes a detailed Manual Tracing Report comparing the custom implementation against sklearn.tree.DecisionTreeClassifier.

โš™๏ธ Core Features

  • Custom Split Logic: Finds the optimal split by maximizing Information Gain or minimizing Gini Impurity.
  • Support for Mixed Data: Handles both continuous (numerical) and categorical features automatically.
  • Configurable Hyperparameters:
    • max_depth: Limits tree growth to prevent overfitting.
    • min_samples_split: Controls the minimum size of a node to attempt a split.
    • min_information_gain: Threshold for valid splits.
  • Performance Metrics: Includes a custom confusion matrix evaluation function.

๐Ÿงฎ Mathematical Foundations

The implementation is based on the following concepts (detailed in docs/Manual_Calculation_Report.pdf):

1. Entropy

$$E(S) = \sum_{i=1}^{c} -p_i \log_2 p_i$$

2. Gini Impurity

$$Gini = 1 - \sum_{i=1}^{c} (p_i)^2$$

3. Information Gain

$$Gain(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)$$

๐Ÿš€ How to Run

  1. Clone the repository:
    git clone [https://github.com/mariamashraf731/Decision-Tree-From-Scratch.git](https://github.com/mariamashraf731/Decision-Tree-From-Scratch.git)
  2. Install requirements:
    pip install pandas numpy scikit-learn
  3. Run the script:
    python src/decision_tree.py

๐Ÿ‘จโ€๐Ÿ’ป Technologies Used

  • Python: Core logic.
  • NumPy & Pandas: efficient data manipulation.
  • Scikit-Learn: Used only for benchmarking and confusion matrix calculation.

About

A pure Python implementation of a Decision Tree Classifier built from scratch using NumPy. Features manual calculation of Entropy & Gini Impurity for node splitting, with a comparative analysis against Scikit-Learn.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages