9 min read

Machine Learning - Predictions with Generalized Linear Models

library(tidyverse)
library(caret)
library(glmnet)
library(ROCR)

Table of Content

  • 1 Introduction
  • 2 Load and prepare the dataset
  • 3 Train the model
  • 4 Manual calculation of some key figures
  • 4.1 p value of Thal7
  • 4.2 Model deviance
  • 4.3 Null deviance
  • 4.4 Pseudo R2
  • 4.5 Model deviance residuals
  • 5 Test set performance
  • 6 Regularization with lasso
  • 6.1 Classification accuracy for the train set
  • 6.2 Classification accuracy for the test set
  • 7 Classification metrics
  • 8 Conclusion

1 Introduction

In my last post (“Machine Learning - Predictions with linear regressions”) I treated the topic of predicting numerical values using regression models. However, regression models are also suitable for making classifications. This can be done by means of so-called logistic regression models. Logistic regression belongs to a class of models known as Generalized Linear Models.

Subsequently, logistical regression will be used to determine the presence of heart disease.

For this post the dataset Statlog (Heart) from the UCI- Machine Learning Repository platform “UCI” was used.

2 Load and prepare the dataset

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
heart = read.table(url)
names(heart) <- c("AGE", "SEX", "CHESTPAIN", "RESTBP", "CHOL", 
                  "SUGAR", "ECG", "MAXHR", "ANGINA", "DEP", "EXERCISE", "FLUOR", 
                  "THAL", "OUTPUT")
glimpse(heart)
## Observations: 270
## Variables: 14
## $ AGE       <dbl> 70, 67, 57, 64, 74, 65, 56, 59, 60, 63, 59, 53, 44, ...
## $ SEX       <dbl> 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1...
## $ CHESTPAIN <dbl> 4, 3, 2, 4, 2, 4, 3, 4, 4, 4, 4, 4, 3, 1, 4, 4, 4, 4...
## $ RESTBP    <dbl> 130, 115, 124, 128, 120, 120, 130, 110, 140, 150, 13...
## $ CHOL      <dbl> 322, 564, 261, 263, 269, 177, 256, 239, 293, 407, 23...
## $ SUGAR     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ ECG       <dbl> 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 0, 0, 2...
## $ MAXHR     <dbl> 109, 160, 141, 105, 121, 140, 142, 142, 170, 154, 16...
## $ ANGINA    <dbl> 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1...
## $ DEP       <dbl> 2.4, 1.6, 0.3, 0.2, 0.2, 0.4, 0.6, 1.2, 1.2, 4.0, 0....
## $ EXERCISE  <dbl> 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 3...
## $ FLUOR     <dbl> 3, 0, 0, 1, 1, 0, 1, 1, 2, 3, 0, 0, 0, 2, 1, 0, 2, 0...
## $ THAL      <dbl> 3, 7, 7, 7, 3, 7, 6, 7, 7, 7, 7, 7, 3, 3, 3, 3, 7, 7...
## $ OUTPUT    <int> 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2...

The Chestpain, Thal and ECG features are all categorical features. Therefore we have to convert them as well as the dependent variable Output (we need class labes of 0 and 1).

heart$CHESTPAIN = factor(heart$CHESTPAIN)
heart$ECG = factor(heart$ECG)
heart$THAL = factor(heart$THAL)
heart$EXERCISE = factor(heart$EXERCISE)
heart$OUTPUT = heart$OUTPUT - 1

3 Train the model

set.seed(987954)
heart_sampling_vector <- 
  createDataPartition(heart$OUTPUT, p = 0.85, list = FALSE)
heart_train <- heart[heart_sampling_vector,]
heart_train_labels <- heart$OUTPUT[heart_sampling_vector]
heart_test <- heart[-heart_sampling_vector,]
heart_test_labels <- heart$OUTPUT[-heart_sampling_vector]

heart_model <- glm(OUTPUT ~ ., data = heart_train, family = binomial("logit"))
summary(heart_model)
## 
## Call:
## glm(formula = OUTPUT ~ ., family = binomial("logit"), data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7137  -0.4421  -0.1382   0.3588   2.8118  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.946051   3.477686  -2.285 0.022321 *  
## AGE         -0.020538   0.029580  -0.694 0.487482    
## SEX          1.641327   0.656291   2.501 0.012387 *  
## CHESTPAIN2   1.308530   1.000913   1.307 0.191098    
## CHESTPAIN3   0.560233   0.865114   0.648 0.517255    
## CHESTPAIN4   2.356442   0.820521   2.872 0.004080 ** 
## RESTBP       0.026588   0.013357   1.991 0.046529 *  
## CHOL         0.008105   0.004790   1.692 0.090593 .  
## SUGAR       -1.263606   0.732414  -1.725 0.084480 .  
## ECG1         1.352751   3.287293   0.412 0.680699    
## ECG2         0.563430   0.461872   1.220 0.222509    
## MAXHR       -0.013585   0.012873  -1.055 0.291283    
## ANGINA       0.999906   0.525996   1.901 0.057305 .  
## DEP          0.196349   0.282891   0.694 0.487632    
## EXERCISE2    0.743530   0.560700   1.326 0.184815    
## EXERCISE3    0.946718   1.165567   0.812 0.416655    
## FLUOR        1.310240   0.308348   4.249 2.15e-05 ***
## THAL6        0.304117   0.995464   0.306 0.759983    
## THAL7        1.717886   0.510986   3.362 0.000774 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 315.90  on 229  degrees of freedom
## Residual deviance: 140.36  on 211  degrees of freedom
## AIC: 178.36
## 
## Number of Fisher Scoring iterations: 6

4 Manual calculation of some key figures

4.1 p value of Thal7

pnorm(3.362 , lower.tail = F) * 2
## [1] 0.0007738012

4.2 Model deviance

log_likelihoods <- function(y_labels, y_probs) {
  y_a <- as.numeric(y_labels)
  y_p <- as.numeric(y_probs)
  y_a * log(y_p) + (1 - y_a) * log(1 - y_p)
}

dataset_log_likelihood <- function(y_labels, y_probs) {
  sum(log_likelihoods(y_labels, y_probs))
}

# ---

deviances <- function(y_labels, y_probs) {
  -2 * log_likelihoods(y_labels, y_probs)
}

dataset_deviance <- function(y_labels, y_probs) {
  sum(deviances(y_labels, y_probs))
}


model_deviance <- function(model, data, output_column) {
    y_labels = data[[output_column]]
    y_probs = predict(model, newdata = data, type = "response")
    dataset_deviance(y_labels, y_probs)
}

model_deviance(heart_model, data = heart_train, output_column = "OUTPUT")
## [1] 140.3561

4.3 Null deviance

null_deviance <- function(data, output_column) {
  y_labels <- data[[output_column]]
  y_probs <- mean(data[[output_column]])
  dataset_deviance(y_labels, y_probs)
}

null_deviance(data = heart_train, output_column = "OUTPUT")
## [1] 315.9023

4.4 Pseudo R2

model_pseudo_r_squared <- function(model, data, output_column) {
  1 - ( model_deviance(model, data, output_column) / 
          null_deviance(data, output_column) )
}

model_pseudo_r_squared(heart_model, data = heart_train, output_column = "OUTPUT")
## [1] 0.5556977
model_chi_squared_p_value <-  function(model, data, output_column) {
  null_df <- nrow(data) - 1
  model_df <- nrow(data) - length(model$coefficients)
  difference_df <- null_df - model_df
  null_deviance <- null_deviance(data, output_column)
  m_deviance <- model_deviance(model, data, output_column)
  difference_deviance <- null_deviance - m_deviance
  pchisq(difference_deviance, difference_df,lower.tail = F)
}

model_chi_squared_p_value(heart_model, data = heart_train, output_column = "OUTPUT")
## [1] 7.294219e-28

Our logistic mode is said to expain roughly 56% of the null deviance with an p value of < .001.

4.5 Model deviance residuals

model_deviance_residuals <- function(model, data, output_column) {
  y_labels = data[[output_column]]
  y_probs = predict(model, newdata = data, type = "response")
  residual_sign = sign(y_labels - y_probs)
  residuals = sqrt(deviances(y_labels, y_probs))
  residual_sign * residuals
}
summary(model_deviance_residuals(heart_model, data = heart_train, output_column = "OUTPUT"))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.71369 -0.44214 -0.13823 -0.02765  0.35875  2.81178

5 Test set performance

train_predictions <- predict(heart_model, newdata = heart_train, type = "response")
train_class_predictions <- as.numeric(train_predictions > 0.5)
mean(train_class_predictions == heart_train$OUTPUT)
## [1] 0.8869565

The classification accuracy on the training set is 89%.

test_predictions = predict(heart_model, newdata = heart_test, type = "response")
test_class_predictions = as.numeric(test_predictions > 0.5)
mean(test_class_predictions == heart_test$OUTPUT)
## [1] 0.775

The classification accuracy on the test set is 78%.

6 Regularization with lasso

heart_train_mat <- model.matrix(OUTPUT ~ ., heart_train)[,-1]
lambdas <- 10 ^ seq(8, -4, length = 250)
heart_models_lasso <- glmnet(heart_train_mat, heart_train$OUTPUT, alpha = 1, lambda = lambdas, family = "binomial")
lasso.cv <- cv.glmnet(heart_train_mat, heart_train$OUTPUT, 
                      alpha = 1,lambda = lambdas, family = "binomial")
lambda_lasso <- lasso.cv$lambda.min
lambda_lasso
## [1] 0.01181107
predict(heart_models_lasso, type = "coefficients", s = lambda_lasso)
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) -4.702476656
## AGE          .          
## SEX          0.982354604
## CHESTPAIN2   0.050583269
## CHESTPAIN3   .          
## CHESTPAIN4   1.481502102
## RESTBP       0.012612051
## CHOL         0.003912907
## SUGAR       -0.536400517
## ECG1         .          
## ECG2         0.321026773
## MAXHR       -0.010563953
## ANGINA       0.795431862
## DEP          0.213059195
## EXERCISE2    0.320076517
## EXERCISE3    0.012296570
## FLUOR        0.925047313
## THAL6        0.066580841
## THAL7        1.484011300

We can see that a number of our features have effectively been removed from the model because their coefficients are zero. Let’s use this new model to measure the classification accuracy on our training and test sets. We observe that in both cases, we get slighty better performance.

6.1 Classification accuracy for the train set

lasso_train_predictions <- predict(heart_models_lasso, s = lambda_lasso, newx = heart_train_mat, type = "response")
lasso_train_class_predictions <- as.numeric(lasso_train_predictions > 0.5)
mean(lasso_train_class_predictions == heart_train$OUTPUT)
## [1] 0.8913043

6.2 Classification accuracy for the test set

heart_test_mat <- model.matrix(OUTPUT ~ ., heart_test)[,-1]
lasso_test_predictions <- predict(heart_models_lasso, s = lambda_lasso, newx = heart_test_mat, type = "response")
lasso_test_class_predictions <- as.numeric(lasso_test_predictions > 0.5)
mean(lasso_test_class_predictions == heart_test$OUTPUT)
## [1] 0.775

7 Classification metrics

In our case a binary confusion matrix can be used to compute a number of other useful performance metrics for our data, such as precision, recall and the specificity.

confusion_matrix <- table(predicted = train_class_predictions, actual = heart_train$OUTPUT)
confusion_matrix
##          actual
## predicted   0   1
##         0 118  16
##         1  10  86
precision <- confusion_matrix[2, 2] / sum(confusion_matrix[2,])
precision
## [1] 0.8958333
recall <- confusion_matrix[2, 2] / sum(confusion_matrix[,2])
recall
## [1] 0.8431373
(f = 2 * precision * recall / (precision + recall))
## [1] 0.8686869

Recall is also known as the true positive rate.

specificity <- confusion_matrix[1,1]/sum(confusion_matrix[1,])
specificity
## [1] 0.880597

Specificity is the false negative rate.

Ideally, what we would like is a visual way to assess the effect of changing the threshold (we have used 0.5 so far) on our performance metrics, and the precision recall curve is one such useful plot.

train_predictions <- predict(heart_model, newdata = heart_train, type = "response")
pred <- prediction(train_predictions, heart_train$OUTPUT)
perf <- performance(pred, measure = "prec", x.measure = "rec")
plot(perf)

The plot shows us, that, to obtain values of recall above 0.8, we’ll have to sacrifice precision quite abruptly. Therefore it is usefull to create a data frame of cutoff values, which are the threshold values for which precision and recall change in our data, along with their corresponding precision and recall values.

For example, we want to find a suitable threshold so that we have at least 90% recall and 80% precision.

thresholds <- data.frame(cutoffs = perf@alpha.values[[1]], recall = perf@x.values[[1]], precision = perf@y.values[[1]])
subset(thresholds,(recall > 0.9) & (precision > 0.8))
##       cutoffs    recall precision
## 112 0.3491857 0.9019608 0.8288288
## 113 0.3472740 0.9019608 0.8214286
## 114 0.3428354 0.9019608 0.8141593
## 115 0.3421438 0.9019608 0.8070175

As we can see, a threshold of 0.35 will satisfy our requirements.

8 Conclusion

We have seen how it is possible to make classifications with a logistic regression and to increase the predictive accuracy of this models.

Source

Miller, J. D., & Forte, R. M. (2017). Mastering Predictive Analytics with R. Packt Publishing Ltd.