Differences between LDA and QDA

  1. If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?
    QDA should perform better on the training set because the higher flexibility allows it to get a better fit. However, we would expect LDA to perform better on the testing set because QDA is likely to overfit the model.

  2. If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?
    We expect QDA to perform better on both the training and testing sets.

  3. In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?
    As the sample size increases, we expect the prediction accuracy of QDA to perform better than LDA, because QDA has a higher flexibility and does not make assumptions about general form. Also, it helps to minimize the effect of the high variance associated with QDA when there is more data to work with.

  4. True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer
    True. When performing LDA on a training set, because it is not a very flexible approach, the model will be fit to the training data, and may not necessarily fit new data as well. With a flexible model, more of the variance is captured and is more likely to fit new data.

ISLR Ch 4 # 10a-h

Data Introduction

A data frame with 1089 observations on the following 9 variables. Year The year that the observation was recorded Lag1 Percentage return for previous week Lag2 Percentage return for 2 weeks previous Lag3 Percentage return for 3 weeks previous Lag4 Percentage return for 4 weeks previous Lag5 Percentage return for 5 weeks previous Volume Volume of shares traded (average number of daily shares traded in billions) Today Percentage return for this week Direction A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week

  1. Let’s first see which variables have the highest correlation.
cor(data[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

We can see that the the correlations between each variable and itself is 1, which is expected. We also notice that the correlations between Lag1 and Lag5 are very close to 0. There is a fairly high correlation between year and volume.

Let’s plot those variables to see the correlation.

ggplot(data=data, aes(x=Year, y=Volume)) + geom_point()

We can see from this plot that there is a positive correlation between Year and Volume. As year increases, the number of shares traded has increased over time as well.

Modeling and Predicting

  1. First, let’s build a glm model to predict Direction based on the variables Lag1:Lag5 and Volume.
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = 'binomial', data=data)
summary(glm.fits)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

We can see that Lag2 (the percentage return for 2 weeks previous) looks to be statistically significant. This would suggeset that if there was a positive turn in the stocks 2 weeks ago, that it will be positive. However, .0296 as the p-value is still relatively low.

  1. Let’s test out the model to see how it performs.
glm.probs = predict(glm.fits, type="response")
head(glm.probs)
##         1         2         3         4         5         6 
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190
Direction <- data$Direction
contrasts(Direction)
##      Up
## Down  0
## Up    1

Using the contrasts function, we can see that higher values in the predictions correspond to up, and lower values correspond to down in the direction of the stock market.

We need to turn these predicted probabilites into the values “Up” and “Down” to predict the classifications. Then we can build the confusion matrix.

glm.pred <- rep("Down", 1089 )
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction)
##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

We can see that our model correctly identified downtrends in the stock market 54 times, and uptrends in the stock market 557 times for a total of 611/1089 correct predictions, or 56.1%. This means that our training error rate is about 43.9%. It is also notable that our model tends to predict up-trends more often than downtrends in the market.

  1. Now, let’s split the data into training and testing sets. Let’s use the data from 1990 to 2008 as the training set and the rest will be the testing set.
train <- data$Year < 2009
Weekly.2008 <- data[!train,] #Testing set
Direction.2008 <- Direction[!train]

Now, we can create our model. We will also use only Lag2 to predict Direction.

glm.fits2 <- glm(Direction ~ Lag2, family='binomial', data=data, subset=train)
summary(glm.fits2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = data, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

In this model summary, we can see that Lag2 shows a positive correlation with the uptrends in the stock market. The estimation value is similar to our previous model.

Now let’s use this model to make predictions and compare it to our other model.

glm.probs2 <- predict(glm.fits, Weekly.2008, type="response")
glm.pred <- rep("Down", 104)
glm.pred[glm.probs2 > .5] <- "Up"
table(glm.pred, Direction.2008)
##         Direction.2008
## glm.pred Down Up
##     Down   17 13
##     Up     26 48
glm.acc <- mean(glm.pred == Direction.2008)

When running this model on our test set and creating the confusion matrix, we can see that this model correctly identifies downtrends in the stock market 17 times and uptrends in the stock market 48 times, for a total of 65/104, or a 62.5% accuracy rate. This also means that we have an error rate of 37.5%, which is better than our original model.

  1. Let’s try a new model: LDA
library(MASS)
lda.fit <- lda(Direction~Lag2, data=Weekly, subset=train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

This suggests that on days where the market was down, there was a negative percentage return for 2 weeks previous, and on days where the market was up, there was a positive percentage return for 2 weeks previous.

lda.pred <- predict(lda.fit, Weekly.2008)
lda.class <- lda.pred$class
table(lda.class, Direction.2008)
##          Direction.2008
## lda.class Down Up
##      Down    9  5
##      Up     34 56
lda.acc <- mean(lda.class == Direction.2008)

We can see that this model correctly identifies 9 instances where the market is down and 56 instances when the market is up for a total of 65/104, or a 62.5% accuracy rate and a 37.5% test error rate, the same results as glm.

  1. Let’s try the QDA model.
qda.fit <- qda(Direction~Lag2, data=Weekly, subset=train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

We see that we have the same correlation of Lag2 with the down and up trends of the stock market using QDA as with LDA.

qda.class <- predict(qda.fit, Weekly.2008)$class
table(qda.class, Direction.2008)
##          Direction.2008
## qda.class Down Up
##      Down    0  0
##      Up     43 61
qda.acc <- mean(qda.class == Direction.2008)

This model shows obvious issues with making predictions of a Down market. It only ever predicts Up. This model correctly identified 61/104 instances, or 58.7%, meaning a test-error rate of 41.3%.

  1. Now we will build one last model: K Nearest Neighbors.
library(class)
#Create training and testing sets for KNN
train.X = cbind(data$Lag2)[train,]
test.X = cbind(data$Lag2)[!train,]
train.Direction = Direction[train]

set.seed(5301)
knn.pred <- knn(data.frame(train.X),data.frame(test.X), train.Direction, k=1)
table(knn.pred, Direction.2008)
##         Direction.2008
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
knn.acc <- mean(knn.pred == Direction.2008)

The K Nearest Neighbors model correctly identified 21 instances where the market was down, and 31 instances when the market was up for a total accuracy of 52/104, or 50% accuracy.

  1. Now, let’s choose our model based on which model provides the best performance.
library(pander)
results <- data.frame(Method = c("glm", "lda", "qda", "knn"), Accuracy = c(glm.acc, lda.acc, qda.acc, knn.acc))
results <- mutate(results, Test_Error= 1-results$Accuracy)
pander(results)
Method Accuracy Test_Error
glm 0.625 0.375
lda 0.625 0.375
qda 0.5865 0.4135
knn 0.5 0.5

We can see that of the 4 models that we tried, glm and lda have a similar accuracy and test error rate, while qda and knn have the worst performance. None of the models are good at predicting the stock market based on previous weeks.