-
Notifications
You must be signed in to change notification settings - Fork 1
API documentation
pymurtree is implemented as a thin Python wrapper around the main C++ MurTree application. The main functionality of MurTree is exposed in pymurtree via the OptimalDecisionTreeClassifier class. Utility functions to load training datasets and export the tree in text and dot formats are also included in the python package.
OptimalDecisionTreeClassifier.py
Methods
-
constructor: initialize the parameters of the model -
fit: fit a decision tree classifier to the given training dataset -
predict: predict the labels for a set of features -
score: return the accuracy on the given test data and labels -
depth: return the depth of the tree -
num_nodes: return the number of nodes of the tree -
export_text: export decision tree in text format -
export_dot: export decision tree in DOT format
Construct the object and initialize the model parameters.
Parameters
-
time: int, default = 600
Maximum runtime given in seconds -
max_depth: int, default = 3 Maximum allowed depth of the tree, where the depth is defined as the largest number of decision/feature nodes from the root to any leaf.
Depth greater than four is usually time consuming." -
max_num_nodes: int, default = 7
Maximum number of decision/feature nodes allowed. Note that a tree with k feature nodes has k+1 leaf nodes. -
sparse_coefficient: float, default = 0.0
Assigns the penalty for using decision/feature nodes.
Large sparse coefficients will result in smaller trees. -
verbose: bool, default = True
Determines if the solver should print logging information to the standard output. -
all_trees: bool, default = True
Instructs the algorithm to compute trees using all allowed combinations of max-depth and max-num-nodes. Used to stress-test the algorithm. -
incremental_frequency: bool, default = True
Activate incremental frequency computation, which takes into account previously computed trees when recomputing the frequency.
In our experiments proved to be effective on all datasets. -
similarity_lower_bound: bool, default = True
Activate similarity-based lower bounding. Disabling this option may be better for some benchmarks, but on most of our tested datasets keeping this on was beneficial. -
node_selection: int, default = 0
Node selection strategy used to decide on whether the algorithm should examine the left or right child node first.
0 = dynamic
1 = post-order -
feature_ordering: int, default = 0
Feature ordering strategy used to determine the order in which features will be inspected in each node.
0 = in-order
1 = random
2 = gini -
random_seed: int, default = 3
Random seed used only if the feature-ordering is set to random.
A seed of -1 assings the seed based on the current time. -
cache_type: int, default = 0
Cache type used to store computed subtrees. Needs to be determined experimentally. 0 = dataset
1 = branch
2 = closure
'Dataset' is more powerful than 'branch' but may required more computational time.
'Closure' is experimental and typically slower than other options. -
duplicate_factor: int, default = 1
Duplicates the instances the given amount of times. Used for stress-testing the algorithm, not a practical parameter.
Returns
- None
Build a decision tree classifier from the training set (x, y).
Parameters
-
x: numpyp.ndarray
The training features. -
y: numpyp.ndarray
The class labels for the training features. -
time: int, default = 600
Maximum runtime given in seconds -
time: int, default = 600
Maximum runtime given in seconds -
max_depth: int, default = 3
Maximum allowed depth of the tree, where the depth is defined as the largest number of decision/feature nodes from the root to any leaf.
Depth greater than four is usually time consuming." -
max_num_nodes: int, default = 7
Maximum number of decision/feature nodes allowed. Note that a tree with k feature nodes has k+1 leaf nodes. -
sparse_coefficient: float, default = 0.0
Assigns the penalty for using decision/feature nodes.
Large sparse coefficients will result in smaller trees. -
verbose: bool, default = True
Determines if the solver should print logging information to the standard output. -
all_trees: bool, default = True
Instructs the algorithm to compute trees using all allowed combinations of max-depth and max-num-nodes. Used to stress-test the algorithm. -
incremental_frequency: bool, default = True
Activate incremental frequency computation, which takes into account previously computed trees when recomputing the frequency.
In our experiments proved to be effective on all datasets. -
similarity_lower_bound: bool, default = True
Activate similarity-based lower bounding. Disabling this option may be better for some benchmarks, but on most of our tested datasets keeping this on was beneficial. -
node_selection: int, default = 0
Node selection strategy used to decide on whether the algorithm should examine the left or right child node first.
0 = dynamic
1 = post-order -
feature_ordering: int, default = 0
Feature ordering strategy used to determine the order in which features will be inspected in each node.
0 = in-order
1 = random
2 = gini -
random_seed: int, default = 3
Random seed used only if the feature-ordering is set to random.
A seed of -1 assings the seed based on the current time. -
cache_type: int, default = 0
Cache type used to store computed subtrees. Needs to be determined experimentally. 0 = dataset
1 = branch
2 = closure
'Dataset' is more powerful than 'branch' but may required more computational time.
'Closure' is experimental and typically slower than other options. -
duplicate_factor: int, default = 1
Duplicates the instances the given amount of times. Used for stress-testing the algorithm, not a practical parameter.
Returns
- None
Example
>>> import pymurtree
>>> import numpy
# Create training data
>>> x = numpy.array([[0, 1, 0, 1], [1, 0, 0, 1], [1, 1, 0, 0]])
>>> y = numpy.array([5, 5, 4])
# Build tree classifier
>>> model = pymurtree.OptimalDecisionTreeClassifier()
>>> model.fit(x, y)Predict the labels for a set of features
Parameters
-
x: numpy.ndarray
1D or 2D array with the set of features. Each row corresponds to an instance, and each column corresponds to a feature.
Returns
-
numpy.ndarray
1D array with the labels of the input features. The i-th element holds the label of the i-th instance in x.
Example
>>> import pymurtree
>>> import numpy
# Create training data
>>> x = numpy.array([[0, 1, 0, 1], [1, 0, 0, 1], [1, 1, 0, 0]])
>>> y = numpy.array([5, 5, 4])
# Build tree classifier
>>> model = pymurtree.OptimalDecisionTreeClassifier()
>>> model.fit(x, y)
# Predict labels for a new set of features
>>> ft = numpy.array([[1, 1, 0, 1], [0, 0, 1, 1], [1, 0, 1, 0]])
>>> labels = model.predict(ft)
>>> print(labels)
>>> [5 5 4]Return the misclassification score of the tree.
Parameters
- None
Returns
-
int
Misclassification score of the tree.
Example
>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.score()
>>> 112Return the depth of the decision tree.
Parameters
- None
Returns
-
int
The maximum depth of the tree.
Example
>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.depth()
>>> 4Return the number of nodes in the tree.
Parameters
- None
Returns
-
int
Number of nodes in the tree.
Example
>>> import pymurtree
>>> clf = pymurtree.OptimalDecisionTreeClassifier()
...
>>> clf.num_nodes()
>>> 15Create a text representation of all the rules in the decision tree. Text is written to out_file if given, otherwise it is displayed on screen (standard output).
Parameters
-
out_file: str, optional
Name of the output file. If None, the result is displayed in the standard output. - feature_names: numpy.ndarray, optional 1D Numpy array that represents the names of the features.
- class_names: Dict[int, str], optional Dictionary with int keys and str values that represent the class names.
Returns
- None
Example
import pymurtree
import numpy
# Create training data
x = numpy.array([[1, 0, 0, 1], [1, 0, 1, 1], [0, 0, 1, 0]]) # features
y = numpy.array([5, 5, 4]) # labels
# Build tree classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)
# Export tree in text format
ftnames = numpy.array(['has legs', 'has hair', 'can swim', 'can fly']) # feature names
clnames = {
0 : 'snake',
1 : 'cat',
2 : 'crocodile',
3 : 'horse',
4 : 'fish',
5 : 'bird'
} # class names
clf.export_text(feature_names=ftnames, class_names=clnames)
#|---has legs is true
#| |---bird
#|---has legs is false
#| |---fish
# Save tree in tree.txt
clf.export_text("tree.txt", ftnames, clnames)Export the decision tree in DOT format. Tree is written to out_file if given, otherwise it is displayed on screen (standard output). This function generates a GraphViz representation of the decision tree. Graphical renderings can be generated using the Graphviz library:
dot -Tpdf tree.dot -o tree.pdf (PDF format)
dot -Tpng tree.dot -o tree.png (PNG format)Parameters
-
out_file: str, optional
Name of the output file. If None, the result is displayed in the standard output. - feature_names: numpy.ndarray, optional 1D Numpy array that represents the names of the features.
- class_names: Dict[int, str], optional Dictionary with int keys and str values that represent the class names.
Returns
- None
Example
import pymurtree
import numpy
x = numpy.array([[1, 0, 0, 1], [1, 0, 1, 1], [0, 0, 1, 0]]) # features
y = numpy.array([5, 5, 4]) # labels
# Build tree classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)
# Export tree in DOT format
ftnames = numpy.array(['has legs', 'has hair', 'can swim', 'can fly']) # feature names
clnames = {
0 : 'snake',
1 : 'cat',
2 : 'crocodile',
3 : 'horse',
4 : 'fish',
5 : 'bird'
} # class names
clf.export_dot(feature_names=ftnames, class_names=clnames)
# digraph Tree {
# node [shape=box, style="filled, rounded", fontname="helvetica", fontsize="8"] ;
# edge [fontname="helvetica", fontsize="6"] ;
# 0 [label=<has legs>, color="#8CB77F", fillcolor="#8CB77F"] ;
# 1 [label=<fish>, color="#B77F8C" fillcolor="#B77F8C"] ;
# 0 -> 1 [label=" 0 "] ;
# 2 [label=<bird>, color="#B77F8C" fillcolor="#B77F8C"] ;
# 0 -> 2 [label=" 1 "] ;
# }
# Save tree in tree.dot
clf.export_dot("tree.dot", ftnames, clnames)# On the command line:
# Render tree in png format using Graphviz
dot -Tpng tree.dot -o tree.png
open tree.png
Utility functions
-
load_murtree_dataset_to_pandas_dataframes: read features and labels from file into a pandas.DataFrame -
load_murtree_dataset_to_numpy_arrays: read features and labels from file into a Numpy.ndarray
Both functions read datasets from a plain text file with the following format:
- The first column contains the target variable (y), and the remaining columns contain the features (x).
- All values should be space-separated integers.
- All features must be 0 or 1 and classes should be equal or greater than zero
- No headers or footers allowed.
These utility functions can be used to easily load the datasets in the murtree-data repository.
Read a dataset of features and labels from a plain text file into a pandas.DataFrame (features) and a pandas.Series (labels)
Parameters
-
path: str
Path to the dataset.
Returns
-
tuple:
x (pandas.DataFrame): all columns from dataset except the first
y (pandas.Series): first column of dataset
Example
```bash git clone https://github.com/MurTree/murtree-data.git ```import pymurtree
from pymurtree.readdata import *
# Load training dataset
(x, y) = load_murtree_dataset_to_pandas_dataframes("murtree-data/DL/anneal.txt")
# Build classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x.to_numpy(), y.to_numpy())Read a dataset of features and labels from a plain text file into a 2D (features) and 1D (labels) numpy.ndarrays
Parameters
-
path: str
Path to the dataset.
Returns
-
tuple:
x (numpy.ndarray): 2D array with all columns except the first
y (numpy.ndarray): 1D array with the first column
Example
```bash git clone https://github.com/MurTree/murtree-data.git ```import pymurtree
from pymurtree.readdata import *
# Load training dataset
(x, y) = load_murtree_dataset_to_numpy_arrays("murtree-data/DL/anneal.txt")
# Build classifier
clf = pymurtree.OptimalDecisionTreeClassifier()
clf.fit(x, y)