Theory

Building Regression Trees

  1. What does it mean to segment the data? What do we call this in MATH 456?
    Splitting observations into different groups (freshman, sophmores, juniors, seniors) etc. In MATH456, we call this stratification.

  2. Are tree based methods supervised or unsupervised methods?
    They are supervised because there is an outcome variable (y).

  3. What does CART stand for?
    Classification and Regression Trees

  4. What determines if a record goes to the left or right of a split? (i.e. what records go left, what go right?)
    The records are subjected to a condition and assigned to the left or right based on the results of that condition.

  5. What do the values in the internal and terminal nodes represent?
    The internal nodes represent the values to partition on, and the terminal nodes represent the values.

  6. How can you tell which variables are most ‘important’ in a tree model?
    The most important factor was years because it was the first partition. When building the tree, we split the models based on the minimized residual sum of squares, so the first split will be most important. Therefore, the higher up you go in the tree, the more important those variables are.

  7. What is the feature space? Compare this to the term predictor space. Do this in your own words. You may want to look into other sources of information (outside the book) for something that fits into your mental model.
    feature space: predictor space: the values that the x values can take on for each variable. They are all characteristics, or things that are measured.

  8. Explain how recursive binary splitting works without using equations.
    This is a top-down, greedy approach to building a tree. It begins at the top (or root) of the tree and builds the tree down to the terminal nodes. At each step, the ‘best’ split is made at that particular step (instead of looking ahead to pick a potentially better split).

  9. Explain why this process has this name. That is, what about this process is recursive? (don’t use the word in the definition either), why binary?
    The process is repeatedly performing the same actions (splitting the data based on lowest RSS) until a stopping criterion is reached.

  10. Why its considered top down and greedy.
    It starts at the top and then splits the observations one at a time at each level (top down). It’s greedy because it doesn’t find the best possible split, but it finds the best split at the immediate location where it is looking.

  11. What is a stopping criterion? Give an example of one.
    Where to stop the algorithm. For example, we need to know where to stop splitting the data, otherwise it will split all the way down to 1 observation per group. An example of the stopping criterion for recursive binary splitting is when the RSS is less than a specific number, or if the number of observations in any of the terminal nodes is less than 5.

  12. Why are decision trees considered a statistical learning model? Where does the algorithmic learning occur?
    Each split adds a rule in classifying the outcome of a group of x’s. By using these rules (following each branch of a tree), the decision tree is able to make classifications. There is no human input telling the computer what rules to make.

Pruning Trees

  1. How are predictions of new values made?
    It is the average of the value of an outcome for all the records with that specific pattern.

  2. Why is pruning necessary? Why can’t we just build a short tree by stopping early?
    There would be a lot of overfitting in models, which would be tuned to that specific data. Stopping too early prevents us from exploring certain branches which may lead to a better split later. This is fixed by creating a larger tree and going back to prune later.

  3. How is Cost complexity pruning similar to LASSO?
    There is a minimized sum of squares with a penalty term. We want the size of the tree to be small to prevent overfitting.

  4. What is the penalty function? Report this in symbols, not words, but define each symbol in words.
    The pentalty function is \(\boldsymbol{\alpha t}\) where \(\boldsymbol{t}\) is the number of terminal nodes and \(\boldsymbol{\alpha}\) is the best value that balances the fit and tree size.

  5. How is \(\boldsymbol{\alpha}\) chosen?
    Using a validation set or cross-validation/resampling, which allows us to create tuning parameters. The lowest MSE is the optimized \(\boldsymbol{\alpha}\) value.

  6. What information do you get out of Figure 8.5? If you were to build this plot as part of your analysis (just the black and green lines), what information can you draw from it?
    We can see the point where the Mean Squared Error is minimized in the training set and the cross validation set. The Training set will get better and better because it is specifically being fit to the data, but the point where to cross validation MSE is minimized is approximately 3 splits before it starts rising due to overfitting.

Classification Trees

  1. How do you make predictions from a classification tree?

  2. What is the classification error rate?
    The classification error rate is the number of observations in a particular training region that do not belong to the most common class.

  3. What does node purity mean? Why is it important (it may be helpful to read through the example before answering this question).
    Node purity is the measure of how “pure” a node is. For example, if every observation that follows a specific path down the tree can only be classified as one outcome, rather than a mix of outcomes, then we say that node is a pure node.

  4. Is a large or small Gini index preferred?
    A small Gini index is preferred because we are looking for “pure” nodes, and the smaller the number, the more one class is favored over another class in that branch. If it is a pure node, this nuumber will be zero.

  5. If prediction accuracy is the goal of tree building, what measure should be used to prune the tree?

Advantages and Disadvantages of Trees

  1. Name one advantage and disadvantage of trees when compared to classical methods. One advantage to using trees as opposed to classical methods is that they mimic a normal decision-making process, so they are easier to understand and interpret. One disadvantage is that they don’t perform as well as many of the regression models that we use today.

Bagging, Random Forests, Boosting (ISLR 8.2)

Bagging

  1. Bagging is a general-purpose procedure for reducing (bias or variance?) of a statistical learning method
    Variance

  2. Explain how bagging works.
    We use bootstrap to repeatedly sample with replacement from our original sample. For each of these bootstrap samples, we build a tree. At the end, we take the average of the predictions from each tree and use that as your prediction. We also do not need to prune the trees.

  3. What is a majority vote in the context of classification?
    Whichever class appears the most in a given prediction. This will be used as the overall prediction for the node.

  4. What is an out of bag error estimate? How does it compare to k-fold CV/LOOCV?
    It is a leave-one-out cross-validation method if B is large. This takes all the predictions from samples that did not include an observation in the 2/3 split of data being used to make the bagged tree, and averages the predictions.

Random Forests

  1. What’s the main difference between building a Random Forest vs simply bagged trees?
    We pick a random subset of the predictors (m) to use to determine splits instead of all the predictors such as in bagging. Because there are lots of different trees, it works out well because there will be plenty of chances for each of the variables to be used.

  2. How is the Gini index used to generate a summary measure of importance for each predictor in a Random Forest?
    For every tree that is created for the random forest, a measure of the prediction from that tree is accumulated. At the end, this number is divided by the total number of trees to get an average. Because each tree has a different subset of predictor variables used, this process can provide a summary of importance for each of the predictors. Also, the scale of this average is irrelevant.

  3. In the gene expression example (video), what critiera do they use to pre-screen variables to go into the model, and why?
    They chose the genes having the highest variance across the training set, because if a gene has a small amount of variance, it probably is not very predictive. This is able to be done because we are using unsupervised rather than supervised screening.

  4. Why can we say that Random Forests is equivlenant to bagging trees when m=p? When m=p, we are using all the possible predictors instead of a random subset of predictors, which is the same as bagging.

  5. Can you overfit a Random Forest model by constructing too many trees?
    No, you cannot overfit with a random forest. At some point, adding more trees will stop improving the error, but it will never hurt the error rate by adding more.

Boosting

  1. How is boosting a sequential method?
    Each tree that is added to the mix of trees is added to improve on the performance of the other trees. These trees are fit to the residuals.

  2. In Boosting, why are the trees are grown on the residuals?
    Each tree is grown to the residuals left over from the previously added trees. That way, each additional tree will help to improve upon the errors left over from the previous collection of trees. To avoid overfitting, we fit very slowly by shrinking back the trees before adding them to the mix. Smaller trees are very effective.

  3. What are the three tuning parameters for boosting?

    1. Number of trees: The number of trees we create with this process
    2. Number of splits in each tree (d): The number of times a tree is split
    3. Shrinkage parameter (\(\boldsymbol\lambda\)): The fraction that we shrink back each tree (0.01 or 0.001 are most common).

Application

The goal of this project is to use decision trees and random forests to make accurate models to predict the class of wine based on various ingredients and characteristics.

Data description

The dataset I will be using is from UCI and has to do with classifying wines based on the level of their ingredients. It has 14 variables and 177 observations. The original dataset has 3 classes of wines (listed as class1, class2, and class3). No other descriptions are given about the classes themselves.

Data preparation

First, let’s read in our data and load the packages we will use.

library(dplyr)
library(ggplot2)
library(rpart)
library(caret)
data <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', stringsAsFactors = FALSE)

Let’s change the column names to make them easier to understand.

colnames(data) <- c("Class", "Alcohol", "Malic_Acid", "Ash", "Alcalinity_of_Ash", "Magnesium", "Total_phenols", "Flavanoids", "Nonflavanoid_Phenols", "Proanthocyanins", "Color_intensity", "Hue", "OD280/OD315", "Proline")
data$Class <- factor(data$Class, labels=(c("Class1", "Class2", "Class3")))
colnames(data) <- make.names(colnames(data))
colnames(data)
##  [1] "Class"                "Alcohol"              "Malic_Acid"          
##  [4] "Ash"                  "Alcalinity_of_Ash"    "Magnesium"           
##  [7] "Total_phenols"        "Flavanoids"           "Nonflavanoid_Phenols"
## [10] "Proanthocyanins"      "Color_intensity"      "Hue"                 
## [13] "OD280.OD315"          "Proline"

Making, Plotting, and Pruning a Decision Tree

First, we create a training and testing set

set.seed(12321)
train.idx <- createDataPartition(data$Class, p=.6, list = FALSE)
train.dta <- data[train.idx,]
test.dta <- data[-train.idx,]

fitControl <- trainControl(method="cv", number = 10)
preproc <- c("center", "scale")

Building the Tree

Using the training set, we will now build the tree.

tree1 <- train(Class ~ .,
              data=train.dta,
              method="rpart",
              tuneLength = 3,
              trControl = fitControl, 
              preProcess = preproc)
t1.out <- tree1$results
pander(t1.out)
cp Accuracy Kappa AccuracySD KappaSD
0.0625 0.867 0.8002 0.1562 0.2314
0.3281 0.8038 0.7013 0.1627 0.2465
0.5156 0.4654 0.1149 0.1019 0.1898

This table shows how well our tree does with 3 iterations (default value of tuneLength, which is the complexity parameter).

The complexity parameter is a tuning parameter for pruning. The default is 3 values, but here, I’ll change it to 7 to evaluate more.

tree2 <- train(Class ~ .,
              data=train.dta,
              method="rpart",
              tuneLength= 7,
              trControl = fitControl, 
              preProcess = preproc)
t2.out <- tree2$results
pander(t2.out)
cp Accuracy Kappa AccuracySD KappaSD
0 0.8961 0.8439 0.05495 0.08151
0.08594 0.8877 0.8305 0.05773 0.08698
0.1719 0.8677 0.8006 0.08238 0.123
0.2578 0.8677 0.8006 0.08238 0.123
0.3438 0.7995 0.6927 0.1529 0.2341
0.4297 0.7473 0.6051 0.1804 0.285
0.5156 0.4588 0.1183 0.1182 0.2033

We can see how increasing the tuning parameter to 7 increases the number of observations in this table.

Plotting the Tree

I want to see the trees for each of these complexity parameters. This first tree is using the default number (3) as the complexity parameter.

plot(tree1)

We can see how only having 3 complexity parameters causes this line graph to have a very long drop from about 0.80 to about 0.47. This is because we didn’t have any measurements after the complexity parameter of 0.33 until it reached 0.52, and only one measurement before that, so there are many sudden changes.

Here is a simple form of what the tree looks like.

plot(tree1$finalModel,  uniform=TRUE)
text(tree1$finalModel, use.n = TRUE, all=TRUE, cex=.8)

Instances with a Proline concentration lower than 0.01629 will be classified as Class 1 before looking at the other attributes. If not, then we look at Flavanoids. If this count is less than -0.7092, then it will be classified as Class 3, otherwise it will be Class 2.

This next tree is using the 7 as the complexity parameter.

plot(tree2)

We can see that having a higher complexity parameter makes this graph look quite a bit better. We are now examining at more complexity parameters, so the accuracy doesn’t begin to drop quickly until about 0.43.

Here is a more beautified and formatted version of what this tree would look like.

rpart.plot::rpart.plot(tree2$finalModel,  uniform=TRUE)

We can see we have more conditions in our tree with the higher complexity parameter. We still ahve the same first condition, which is most important: Proline. But now we have secondary conditions for both branches of the proline attribute to consider in the classification, which will likely lead to higher accuracy.

Predicting With A Decision Tree

Here, we will use the trees we created to make predictions of the classes using our testing dataset. First is our tree with a complexity parameter of 3.

test.dta$pred.class <- tree1 %>% predict(test.dta)
confusionMatrix(test.dta$Class, test.dta$pred.class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class1 Class2 Class3
##     Class1     21      2      0
##     Class2      2     26      0
##     Class3      1      6     12
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8429          
##                  95% CI : (0.7362, 0.9189)
##     No Information Rate : 0.4857          
##     P-Value [Acc > NIR] : 5.44e-10        
##                                           
##                   Kappa : 0.7569          
##                                           
##  Mcnemar's Test P-Value : 0.0719          
## 
## Statistics by Class:
## 
##                      Class: Class1 Class: Class2 Class: Class3
## Sensitivity                 0.8750        0.7647        1.0000
## Specificity                 0.9565        0.9444        0.8793
## Pos Pred Value              0.9130        0.9286        0.6316
## Neg Pred Value              0.9362        0.8095        1.0000
## Prevalence                  0.3429        0.4857        0.1714
## Detection Rate              0.3000        0.3714        0.1714
## Detection Prevalence        0.3286        0.4000        0.2714
## Balanced Accuracy           0.9158        0.8546        0.9397

There is an accuracy of 84.29% with a total of 11/70 misclassified instances.

Here are our results using the higher complexity parameter of 7.

test.dta$pred.class <- tree2 %>% predict(test.dta)
cm <- confusionMatrix(test.dta$Class, test.dta$pred.class)
dt.acc <- cm$overall[["Accuracy"]]
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class1 Class2 Class3
##     Class1     21      2      0
##     Class2      1     26      1
##     Class3      0      6     13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.7529, 0.9293)
##     No Information Rate : 0.4857          
##     P-Value [Acc > NIR] : 9.222e-11       
##                                           
##                   Kappa : 0.7796          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Class1 Class: Class2 Class: Class3
## Sensitivity                 0.9545        0.7647        0.9286
## Specificity                 0.9583        0.9444        0.8929
## Pos Pred Value              0.9130        0.9286        0.6842
## Neg Pred Value              0.9787        0.8095        0.9804
## Prevalence                  0.3143        0.4857        0.2000
## Detection Rate              0.3000        0.3714        0.1857
## Detection Prevalence        0.3286        0.4000        0.2714
## Balanced Accuracy           0.9564        0.8546        0.9107

The overall test accuracy is 85.71%. We misclassified 1 instance from Class1, 8 instances from Class2, and 1 instance from Class3. We misclassified 10/70 cases. Compared to our previous model, we have an accuracy increase of only 0.0142, which is not a significant increase in predictive ability. However, we did classify one more instance correctly in this model than in the previous model.

Growing, Plotting and Interpreting a Random Forest

Build the Random Forest

Here, I’ll create the random forest model.

rf.tree <- train(Class ~ .,
                 data=train.dta,
                 method="rf",
                 trControl = fitControl,
                 preProcess = preproc)
rf.tree
## Random Forest 
## 
## 107 samples
##  13 predictor
##   3 classes: 'Class1', 'Class2', 'Class3' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 96, 96, 96, 96, 97, 97, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9909091  0.9862500
##    7    0.9818182  0.9725000
##   13    0.9636364  0.9453395
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Plot The Accuracy

Here, I’ll plot the accuracy similar to the accuracy graph in the previous section.

plot(rf.tree)

>We can see a trend in this graph where a higher number of randomly selected predictors yeilded lower accuracy measures.

Plot the Important Variables

Here, I’ll plot the important variables from the reference tree.

varImp(rf.tree)
## rf variable importance
## 
##                        Overall
## Proline              100.00000
## Flavanoids            91.57635
## Alcohol               89.74064
## Color_intensity       83.83261
## Hue                   61.51626
## Total_phenols         45.94456
## OD280.OD315           45.28047
## Malic_Acid            38.14085
## Proanthocyanins       19.06022
## Nonflavanoid_Phenols   2.23826
## Alcalinity_of_Ash      0.89973
## Magnesium              0.05264
## Ash                    0.00000

This shows how important each variable was in making the predictions the random forest will make.

Testing

Now, we will use the test data on our random forest model to see how accurately it can classify instances using data it hasn’t seen.

test.dta$pred2.class <- rf.tree %>% predict(test.dta)
cm <- confusionMatrix(test.dta$Class, test.dta$pred2.class)
rf.acc <- cm$overall[['Accuracy']]
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class1 Class2 Class3
##     Class1     23      0      0
##     Class2      1     26      1
##     Class3      0      0     19
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9714          
##                  95% CI : (0.9006, 0.9965)
##     No Information Rate : 0.3714          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9568          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Class1 Class: Class2 Class: Class3
## Sensitivity                 0.9583        1.0000        0.9500
## Specificity                 1.0000        0.9545        1.0000
## Pos Pred Value              1.0000        0.9286        1.0000
## Neg Pred Value              0.9787        1.0000        0.9804
## Prevalence                  0.3429        0.3714        0.2857
## Detection Rate              0.3286        0.3714        0.2714
## Detection Prevalence        0.3286        0.4000        0.2714
## Balanced Accuracy           0.9792        0.9773        0.9750

We can see here that the accuracy of the random forest model has increased significantly by using the random forest model as opposed to the decision tree model, from 85.7% to 97.1%. We misclassified one instance from class 1 and 1 instance from class 3.

Compared to previous models

Here, I’ll create a table to compare these models with some other models that worked with three classes.

results <- data.frame(Model = c("DecisionTree", "RandomForest", "LDA-RepeatedCV", "LDA-CV"), Accuracy = c(dt.acc, rf.acc, rep.cv.acc, cv.acc))
results <- mutate(results, Test_Error= 1-results$Accuracy)
results <- results %>% arrange(desc(Accuracy))
pander(results)
Model Accuracy Test_Error
LDA-RepeatedCV 0.9813 0.01871
LDA-CV 0.9805 0.01952
RandomForest 0.9714 0.02857
DecisionTree 0.8571 0.1429

Because I had to collapse my Classes in the previous assignment in order to get many of the models in the previous assignment to work, I only had two models that I could compare it with. It looks like the LDA with repeated cross-validation still has the highest predictive accuracy for this data, followed by regular LDA. Random Forest is definitely up there with the LDA with cross-validation models, performing almost the same. The decision tree model does not perform very well in comparison.