# Decision Trees

There might be multiple decision trees for deciding the same thing from different conditions. To decide which is best, we use Gini Impurity

$$\text{Gini Impurity}=1-(\text{the probability of Yes})^2-(\text{the Probability of No})^2$$

A weighted average should be used if the sample size is different

The lower the value the better

From a raw table of data to a decision tree:

1. Calculate all the Gini Impurity values

2. If a node itself has the lowest value, leave it as a Leaf node, don’t further separate it

3. If separating the data results in an improvement, then pick the separation with the lowest Gini impurity value

## Numeric Data

To get impurities

1. Sort the values lowest to highest

2. Calculate the average for adjacent values

3. Calculate the impurity values for each average weight

• For each average, look at the yes and no instances on the greater than and less than sections, use these for the probabilities

To Build a tree:

1. Yes/no questions at each step

2. Numeric data, like patient weight

## Ranked Data and Multiple Choice Data

Ranked Data Multiple Choices Data ## Missing data

Options for boolean:

• Choose the most common value in the column

• Find another column that has the highest correlation with the feature and use that as a guide

Options for numbers:

• Use mean

• Use linear regression with another column with a good correlation

# Random Forests

Why Random Forests:

• Decision Trees are easy to build, use and interpret, but not flexible when classifying new samples

• Random forests combine the simplicity of decision trees with flexibility for better accuracy

## How to build a random forest

Step 1 - Create a “bootstrapped” dataset:

• Same size as the original dataset

• Randomly selected samples from the original dataset

• Samples can be selected more than once

Step 2 - Build a decision tree using “bootstrapped” dataset, but only use a random subset of variables, e.g. 2

Step 3 - Go back to step 1 and repeat: make a new bootstrap dataset and build a tree considering a subset of variables at each step (ideally 100’s of times)

• Using a bootstrapped sample and considering only a subset of the variables at each steps results in a wide variety of trees

• The variety makes random forests more effective than individual Decision Trees

## How to use a random forest

• Take the data and run it down the first tree we built

• Keep track of the result

• Then run the next data down the second tree

• Then run the next data down all the trees and what the majority of the trees choose is the outcome

Bagging
Bootstrapping the data plus using the aggregate to make a decision

## Performance

Out of bag dataset
Data that was not used in the bootstrapped dataset
• Use the data that doesn’t end up in the bootstrapped dataset for testing

• Run the data on the trees and see if the outcome is correctly predicted

• Use the number that correctly predict vs incorrectly predict as the measure

• Repeat for all samples and trees

Out of bag error
The proportion of out of bag samples that were incorrectly classified