Decision Trees Tests¶
This is an example of using my implementation of a Decision Tree. My implementation is based of of this example and modified to use pandas for data processing.
This example looks at bank note data to classify if a note is real or fake. The data is sensor data and the target is the class we are trying to predict.
import pandas as pd
pd.read_csv("datasets/data_banknote_authentication.csv", names=['f1', 'f2', 'f3', 'f4', 'target'])
f1 | f2 | f3 | f4 | target | |
---|---|---|---|---|---|
0 | 3.62160 | 8.66610 | -2.8073 | -0.44699 | 0 |
1 | 4.54590 | 8.16740 | -2.4586 | -1.46210 | 0 |
2 | 3.86600 | -2.63830 | 1.9242 | 0.10645 | 0 |
3 | 3.45660 | 9.52280 | -4.0112 | -3.59440 | 0 |
4 | 0.32924 | -4.45520 | 4.5718 | -0.98880 | 0 |
... | ... | ... | ... | ... | ... |
1367 | 0.40614 | 1.34920 | -1.4501 | -0.55949 | 1 |
1368 | -1.38870 | -4.87730 | 6.4774 | 0.34179 | 1 |
1369 | -3.75030 | -13.45860 | 17.5932 | -2.77710 | 1 |
1370 | -3.56370 | -8.38270 | 12.3930 | -1.28230 | 1 |
1371 | -2.54190 | -0.65804 | 2.6842 | 1.19520 | 1 |
1372 rows × 5 columns
import random
from decision_tree import DecisionTree
random.seed(1)
df = pd.read_csv("./datasets/data_banknote_authentication.csv", header=None)
df = df.sample(frac=1, random_state=1).reset_index(drop=True) # data is shuffled
Xs = df.iloc[:, :-1]
ys = df.iloc[:, -1].to_frame()
Xs
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -3.55100 | 1.89550 | 0.186500 | -2.440900 |
1 | 1.31140 | 4.54620 | 2.293500 | 0.225410 |
2 | -4.01730 | -8.31230 | 12.454700 | -1.437500 |
3 | -5.11900 | 6.64860 | -0.049987 | -6.520600 |
4 | 3.62890 | 0.81322 | 1.627700 | 0.776270 |
... | ... | ... | ... | ... |
1367 | 3.49160 | 8.57090 | -3.032600 | -0.591820 |
1368 | 0.74521 | 3.63570 | -4.404400 | -4.141400 |
1369 | -4.36670 | 6.06920 | 0.572080 | -5.466800 |
1370 | 2.04660 | 2.03000 | 2.176100 | -0.083634 |
1371 | -2.31470 | 3.66680 | -0.696900 | -1.247400 |
1372 rows × 4 columns
ys
4 | |
---|---|
0 | 1 |
1 | 0 |
2 | 1 |
3 | 1 |
4 | 0 |
... | ... |
1367 | 0 |
1368 | 1 |
1369 | 1 |
1370 | 0 |
1371 | 1 |
1372 rows × 1 columns
Next the data is split into a train and test set
splitInd = int(len(Xs) * 0.8)
trainX = Xs.iloc[0:splitInd]
testX = Xs.iloc[splitInd:]
trainY = ys.iloc[0:splitInd]
testY = ys.iloc[splitInd:]
trainX, testY
( 0 1 2 3 0 -3.5510 1.89550 0.186500 -2.440900 1 1.3114 4.54620 2.293500 0.225410 2 -4.0173 -8.31230 12.454700 -1.437500 3 -5.1190 6.64860 -0.049987 -6.520600 4 3.6289 0.81322 1.627700 0.776270 ... ... ... ... ... 1092 -1.3887 -4.87730 6.477400 0.341790 1093 1.5701 7.91290 0.290180 -2.195300 1094 1.0135 8.45510 -1.672000 -2.081500 1095 0.3798 0.70980 0.757200 -0.444400 1096 -1.8219 -6.88240 5.468100 0.057313 [1097 rows x 4 columns], 4 1097 0 1098 1 1099 0 1100 0 1101 0 ... .. 1367 0 1368 1 1369 1 1370 0 1371 1 [275 rows x 1 columns])
Next the tree will be created and fit to the test data
dt = DecisionTree()
dt.fit(trainX, trainY)
dt.printTree()
C:\Users\sam_m\Documents\machine-learning-trees\decision_tree.py:15: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self.Xs['$target$'] = self.ys.iloc[:, 0]
[X0 < 0.322] [X1 < 7.627] [X0 < -0.398] [X2 < 6.220] [X0 < -3.551] [1.0] [1.0] [X1 < -4.606] [1.0] [0.0] [X1 < 5.897] [X2 < 3.114] [1.0] [0.0] [0.0] [X0 < -4.286] [X0 < -5.490] [1.0] [1.0] [X0 < -1.180] [X0 < -1.327] [0.0] [0.0] [X0 < -1.180] [0.0] [0.0] [X2 < -4.413] [X0 < 4.407] [X0 < 2.392] [X0 < 0.816] [1.0] [1.0] [1.0] [0.0] [X0 < 1.594] [X2 < -2.272] [X1 < 7.638] [1.0] [0.0] [X3 < 0.097] [0.0] [0.0] [X0 < 2.042] [X2 < -2.339] [1.0] [0.0] [X0 < 3.629] [0.0] [0.0]
Now a prediction will is performed and the accuracy is calculated
p = dt.predict(testX)
c = 0
testY.reset_index(drop=True, inplace=True)
for i, r in testY.iterrows():
if p[i] == testY.iat[i, 0]:
c += 1
print("Accuracy:", c / len(p))
Accuracy: 0.9709090909090909
On this data, the decision tree is about 97% accurate. This data works well for analysis with a decision tree.
This project was a great way for me to learn about decision trees and to expand my knowledge of machine learning.