Decision Trees Tests¶

This is an example of using my implementation of a Decision Tree. My implementation is based of of this example and modified to use pandas for data processing.

This example looks at bank note data to classify if a note is real or fake. The data is sensor data and the target is the class we are trying to predict.

In [1]:
import pandas as pd
pd.read_csv("datasets/data_banknote_authentication.csv", names=['f1', 'f2', 'f3', 'f4', 'target'])
Out[1]:
f1 f2 f3 f4 target
0 3.62160 8.66610 -2.8073 -0.44699 0
1 4.54590 8.16740 -2.4586 -1.46210 0
2 3.86600 -2.63830 1.9242 0.10645 0
3 3.45660 9.52280 -4.0112 -3.59440 0
4 0.32924 -4.45520 4.5718 -0.98880 0
... ... ... ... ... ...
1367 0.40614 1.34920 -1.4501 -0.55949 1
1368 -1.38870 -4.87730 6.4774 0.34179 1
1369 -3.75030 -13.45860 17.5932 -2.77710 1
1370 -3.56370 -8.38270 12.3930 -1.28230 1
1371 -2.54190 -0.65804 2.6842 1.19520 1

1372 rows × 5 columns

In [2]:
import random
from decision_tree import DecisionTree

random.seed(1)
df = pd.read_csv("./datasets/data_banknote_authentication.csv", header=None)
df = df.sample(frac=1, random_state=1).reset_index(drop=True) # data is shuffled
Xs = df.iloc[:, :-1]
ys = df.iloc[:, -1].to_frame()
Xs
Out[2]:
0 1 2 3
0 -3.55100 1.89550 0.186500 -2.440900
1 1.31140 4.54620 2.293500 0.225410
2 -4.01730 -8.31230 12.454700 -1.437500
3 -5.11900 6.64860 -0.049987 -6.520600
4 3.62890 0.81322 1.627700 0.776270
... ... ... ... ...
1367 3.49160 8.57090 -3.032600 -0.591820
1368 0.74521 3.63570 -4.404400 -4.141400
1369 -4.36670 6.06920 0.572080 -5.466800
1370 2.04660 2.03000 2.176100 -0.083634
1371 -2.31470 3.66680 -0.696900 -1.247400

1372 rows × 4 columns

In [3]:
ys
Out[3]:
4
0 1
1 0
2 1
3 1
4 0
... ...
1367 0
1368 1
1369 1
1370 0
1371 1

1372 rows × 1 columns

Next the data is split into a train and test set

In [4]:
splitInd = int(len(Xs) * 0.8)

trainX = Xs.iloc[0:splitInd]
testX  = Xs.iloc[splitInd:]
trainY = ys.iloc[0:splitInd]
testY  = ys.iloc[splitInd:]
In [5]:
trainX, testY
Out[5]:
(           0        1          2         3
 0    -3.5510  1.89550   0.186500 -2.440900
 1     1.3114  4.54620   2.293500  0.225410
 2    -4.0173 -8.31230  12.454700 -1.437500
 3    -5.1190  6.64860  -0.049987 -6.520600
 4     3.6289  0.81322   1.627700  0.776270
 ...      ...      ...        ...       ...
 1092 -1.3887 -4.87730   6.477400  0.341790
 1093  1.5701  7.91290   0.290180 -2.195300
 1094  1.0135  8.45510  -1.672000 -2.081500
 1095  0.3798  0.70980   0.757200 -0.444400
 1096 -1.8219 -6.88240   5.468100  0.057313
 
 [1097 rows x 4 columns],
       4
 1097  0
 1098  1
 1099  0
 1100  0
 1101  0
 ...  ..
 1367  0
 1368  1
 1369  1
 1370  0
 1371  1
 
 [275 rows x 1 columns])

Next the tree will be created and fit to the test data

In [6]:
dt = DecisionTree()
dt.fit(trainX, trainY)
dt.printTree()
C:\Users\sam_m\Documents\machine-learning-trees\decision_tree.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.Xs['$target$'] = self.ys.iloc[:, 0]
[X0 < 0.322]
 [X1 < 7.627]
  [X0 < -0.398]
   [X2 < 6.220]
    [X0 < -3.551]
     [1.0]
     [1.0]
    [X1 < -4.606]
     [1.0]
     [0.0]
   [X1 < 5.897]
    [X2 < 3.114]
     [1.0]
     [0.0]
    [0.0]
  [X0 < -4.286]
   [X0 < -5.490]
    [1.0]
    [1.0]
   [X0 < -1.180]
    [X0 < -1.327]
     [0.0]
     [0.0]
    [X0 < -1.180]
     [0.0]
     [0.0]
 [X2 < -4.413]
  [X0 < 4.407]
   [X0 < 2.392]
    [X0 < 0.816]
     [1.0]
     [1.0]
    [1.0]
   [0.0]
  [X0 < 1.594]
   [X2 < -2.272]
    [X1 < 7.638]
     [1.0]
     [0.0]
    [X3 < 0.097]
     [0.0]
     [0.0]
   [X0 < 2.042]
    [X2 < -2.339]
     [1.0]
     [0.0]
    [X0 < 3.629]
     [0.0]
     [0.0]

Now a prediction will is performed and the accuracy is calculated

In [7]:
p = dt.predict(testX)

c = 0
testY.reset_index(drop=True, inplace=True)
for i, r in testY.iterrows():
    if p[i] == testY.iat[i, 0]:
        c += 1

print("Accuracy:", c / len(p))
Accuracy: 0.9709090909090909

On this data, the decision tree is about 97% accurate. This data works well for analysis with a decision tree.

This project was a great way for me to learn about decision trees and to expand my knowledge of machine learning.

In [ ]: