There might be multiple decision trees for deciding the same thing from different conditions. To decide which is best, we use Gini Impurity

$$\text{Gini Impurity}=1-(\text{the probability of Yes})^2-(\text{the Probability of No})^2$$

A weighted average should be used if the sample size is different

The lower the value the better

From a raw table of data to a decision tree:

Calculate all the Gini Impurity values

If a node itself has the lowest value, leave it as a Leaf node, don’t further separate it

If separating the data results in an improvement, then pick the separation with the lowest Gini impurity value

To get impurities

Sort the values lowest to highest

Calculate the average for adjacent values

Calculate the impurity values for each average weight

- For each average, look at the yes and no instances on the greater than and less than sections, use these for the probabilities

To Build a tree:

Yes/no questions at each step

Numeric data, like patient weight

Ranked Data

Multiple Choices Data

Options for boolean:

Choose the most common value in the column

Find another column that has the highest correlation with the feature and use that as a guide

Options for numbers:

Use mean

Use linear regression with another column with a good correlation

Why Random Forests:

Decision Trees are easy to build, use and interpret, but not flexible when classifying new samples

Random forests combine the simplicity of decision trees with flexibility for better accuracy

**Step 1** - Create a “bootstrapped” dataset:

Same size as the original dataset

Randomly selected samples from the original dataset

Samples can be selected more than once

**Step 2** - Build a decision tree using “bootstrapped” dataset, but
only use a random subset of variables, e.g. 2

**Step 3** - Go back to step 1 and repeat: make a new bootstrap dataset
and build a tree considering a subset of variables at each step (ideally
100’s of times)

Using a bootstrapped sample and considering only a subset of the variables at each steps results in a wide variety of trees

The variety makes random forests more effective than individual Decision Trees

Take the data and run it down the first tree we built

Keep track of the result

Then run the next data down the second tree

Then run the next data down all the trees and what the majority of the trees choose is the outcome

Bagging

Bootstrapping the data plus using the aggregate to make a decision

Out of bag dataset

Data that was not used in the bootstrapped dataset

Use the data that doesn’t end up in the bootstrapped dataset for testing

Run the data on the trees and see if the outcome is correctly predicted

Use the number that correctly predict vs incorrectly predict as the measure

Repeat for all samples and trees

Out of bag error

The proportion of out of bag samples that were incorrectly classified