Logistic regression

Logistic regressionInferenceProf. Maria Tackett1

Click for PDF of slides

Risk of coronary heart disease

This dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease in the next 10 years.

high_risk: 1 = High risk, 0 = Not high risk

age: Age at exam time (in years)

education: 1 = Some High School; 2 = High School or GED; 3 = Some College or Vocational School; 4 = College

currentSmoker: 0 = nonsmoker; 1 = smoker

Modeling risk of coronary heart disease

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

Hypothesis test for

Hypotheses:

Hypothesis test for

Hypotheses:

Test Statistic:

Hypothesis test for

Hypotheses:

Test Statistic:

P-value: ,

where , the Standard Normal distribution

Confidence interval for

We can calculate the C% confidence interval for as the following:

where

is calculated from the

distribution

Confidence interval for

We can calculate the C% confidence interval for as the following:

where

is calculated from the

distribution

This is an interval for the change in the log-odds for every one unit increase in .

Interpretation in terms of the odds

The change in odds for every one unit increase in .

Interpretation in terms of the odds

The change in odds for every one unit increase in .

Interpretation: We are confident that for every one unit increase in , the odds multiply by a factor of to , holding all else constant.

Model

term
estimate
std.error
statistic
p.value
conf.low
conf.high


(Intercept)
-5.385
0.308
-17.507
0.000
-5.995
-4.788

age
0.073
0.005
13.385
0.000
0.063
0.084

education2
-0.242
0.112
-2.162
0.031
-0.463
-0.024

education3
-0.235
0.134
-1.761
0.078
-0.501
0.023

education4
-0.020
0.148
-0.136
0.892
-0.317
0.266

8

Let's look at the coefficient for age
 
    term 
    estimate 
    std.error 
    statistic 
    p.value 
    conf.low 
    conf.high 
  


    (Intercept) 
    -5.385 
    0.308 
    -17.507 
    0.000 
    -5.995 
    -4.788 
  

    age 
    0.073 
    0.005 
    13.385 
    0.000 
    0.063 
    0.084 
  

    education2 
    -0.242 
    0.112 
    -2.162 
    0.031 
    -0.463 
    -0.024 
  

    education3 
    -0.235 
    0.134 
    -1.761 
    0.078 
    -0.501 
    0.023 
  

    education4 
    -0.020 
    0.148 
    -0.136 
    0.892 
    -0.317 
    0.266 
  


9

Let's look at the coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

Hypotheses

Let's look at the coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

Test statistic

Let's look at the coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

P-value

Let's look at the coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

2 * pnorm(13.4,lower.tail = FALSE)

## [1] 6.046315e-41

Let's look at the coefficient for `age`

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

Conclusion: The p-value is very small, so we reject . The data provide sufficient evidence that age is a statistically significant predictor of whether someone is high risk of having heart disease, after accounting for education.

Comparing models14

Log likelihood

Measure of how well the model fits the data

Log likelihood

Measure of how well the model fits the data
Higher values of are better

Log likelihood

Measure of how well the model fits the data
Higher values of are better
Deviance =
- follows a distribution with degrees of freedom

Comparing nested modelsSuppose there are two models: Reduced Model includes predictors x1,…,xq
Full Model includes predictors x1,…,xq,xq+1,…,xp

16

Comparing nested modelsSuppose there are two models: Reduced Model includes predictors x1,…,xq
Full Model includes predictors x1,…,xq,xq+1,…,xp

We want to test the hypotheses
H0:βq+1=⋯=βp=0Ha: at least 1 βj is not 0
16

Comparing nested models

Suppose there are two models:
- Reduced Model includes predictors
- Full Model includes predictors

We want to test the hypotheses
To do so, we will use the Drop-in-deviance test (also known as the Nested Likelihood Ratio test)

Drop-in-deviance test

Hypotheses:

Drop-in-deviance test

Hypotheses:

Test Statistic:

Drop-in-deviance test

Hypotheses:

Test Statistic:

P-value: ,

calculated using a distribution with degrees of freedom equal to the difference in the number of parameters in the full and reduced models

distribution

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-5.385	0.308	-17.507	0.000	-5.995	-4.788
age	0.073	0.005	13.385	0.000	0.063	0.084
education2	-0.242	0.112	-2.162	0.031	-0.463	-0.024
education3	-0.235	0.134	-1.761	0.078	-0.501	0.023
education4	-0.020	0.148	-0.136	0.892	-0.317	0.266

Should we add currentSmoker to this model?

Should we add `currentSmoker` to the model?

model_reduced <- glm(high_risk ~ age + education, 
              data = heart, family = "binomial")

model_full <- glm(high_risk ~ age + education + 
                    currentSmoker,
              data = heart, family = "binomial")

Should we add `currentSmoker` to the model?

# Calculate deviance for each model
(dev_reduced <- glance(model_reduced)$deviance)

## [1] 3300.135

(dev_full <- glance(model_full)$deviance)

## [1] 3279.359

Should we add `currentSmoker` to the model?

# Calculate deviance for each model
(dev_reduced <- glance(model_reduced)$deviance)

## [1] 3300.135

(dev_full <- glance(model_full)$deviance)

## [1] 3279.359

# Drop-in-deviance test statistic
(test_stat <- dev_reduced - dev_full)

## [1] 20.77589

Should we add `currentSmoker` to the model?

# p-value
#1 = number of new model terms in model 2
pchisq(test_stat, 1, lower.tail = FALSE)

## [1] 5.162887e-06

Should we add `currentSmoker` to the model?

# p-value
#1 = number of new model terms in model 2
pchisq(test_stat, 1, lower.tail = FALSE)

## [1] 5.162887e-06

Conclusion: The p-value is very small, so we reject . The data provide sufficient evidence that the coefficient of currentSmoker is not equal to 0. Therefore, we should add it to the model.

Drop-in-Deviance test in R

We can use the anova function to conduct this test

Add test = "Chisq" to conduct the drop-in-deviance test

anova(model_reduced, model_full, test = "Chisq") %>%
  tidy()

## # A tibble: 2 x 5
##   Resid..Df Resid..Dev    df Deviance     p.value
##       <dbl>      <dbl> <dbl>    <dbl>       <dbl>
## 1      4130      3300.    NA     NA   NA         
## 2      4129      3279.     1     20.8  0.00000516

Model selection

Use AIC or BIC for model selection

AIC from `glance` function

Let's look at the AIC for the model that includes age, education, and currentSmoker

glance(model_full)$AIC

## [1] 3291.359

AIC from `glance` function

Let's look at the AIC for the model that includes age, education, and currentSmoker

glance(model_full)$AIC

## [1] 3291.359

Calculating AIC

- 2 * glance(model_full)$logLik + 2 * (5 + 1)

## [1] 3291.359

Comparing the models using AIC

Let's compare the full and reduced models using AIC.

glance(model_reduced)$AIC

## [1] 3310.135

glance(model_full)$AIC

## [1] 3291.359

Based on AIC, which model would you choose?

Comparing the models using BIC

Let's compare the full and reduced models using BIC

glance(model_reduced)$BIC

## [1] 3341.772

glance(model_full)$BIC

## [1] 3329.323

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Logistic regression

Inference

Prof. Maria Tackett

Click for PDF of slides

Risk of coronary heart disease

Modeling risk of coronary heart disease

Hypothesis test for βj

Hypothesis test for βj

Hypothesis test for βj

Confidence interval for βj

Confidence interval for βj

Interpretation in terms of the odds

Interpretation in terms of the odds

Model

Let's look at the coefficient for age

Let's look at the coefficient for age

Let's look at the coefficient for age

Let's look at the coefficient for age

Let's look at the coefficient for age

Let's look at the coefficient for age

Comparing models

Log likelihood

Log likelihood

Log likelihood

Log likelihood

Comparing nested models

Comparing nested models

Comparing nested models

Drop-in-deviance test

Drop-in-deviance test

Drop-in-deviance test

χ2 distribution

Should we add currentSmoker to the model?

Should we add currentSmoker to the model?

Should we add currentSmoker to the model?

Should we add currentSmoker to the model?

Should we add currentSmoker to the model?

Drop-in-Deviance test in R

Model selection

AIC from glance function

AIC from glance function

Comparing the models using AIC

Comparing the models using BIC

Click for PDF of slides

Help

Hypothesis test for

Hypothesis test for

Hypothesis test for

Confidence interval for

Confidence interval for

Let's look at the coefficient for `age`

Let's look at the coefficient for `age`

Let's look at the coefficient for `age`

Let's look at the coefficient for `age`

Let's look at the coefficient for `age`

Let's look at the coefficient for `age`

distribution

Should we add `currentSmoker` to the model?

Should we add `currentSmoker` to the model?

Should we add `currentSmoker` to the model?

Should we add `currentSmoker` to the model?

Should we add `currentSmoker` to the model?

AIC from `glance` function

AIC from `glance` function