+ - 0:00:00
Notes for current slide
Notes for next slide

Multinomial Logistic Regression

Introduction

Prof. Maria Tackett

1

Topics

  • Introduce multinomial logistic regression

  • Interpret model coefficients

  • Inference for a coefficient βjk

3

Generalized Linear Models (GLM)

  • In practice, there are many different types of response variables including:

    • Binary: Win or Lose
    • Nominal: Democrat, Republican or Third Party candidate
    • Ordered: Movie rating (1 - 5 stars)
    • and others...
  • These are all examples of generalized linear models, a broader class of models that generalize the multiple linear regression model

  • See Generalized Linear Models: A Unifying Theory for more details about GLMs

4

Binary Response (Logistic)

  • Given P(yi=1|xi)=ˆπi and P(yi=0|xi)=1ˆπi

log(ˆπi1ˆπi)=ˆβ0+ˆβ1xi


  • We can calculate ˆπi by solving the logit equation:

ˆπi=exp{ˆβ0+ˆβ1xi}1+exp{ˆβ0+ˆβ1xi}

5

Binary Response (Logistic)

Suppose we consider y=0 the baseline category such that

P(yi=1|xi)=ˆπi1 and P(yi=0|xi)=ˆπi0

6

Binary Response (Logistic)

Suppose we consider y=0 the baseline category such that

P(yi=1|xi)=ˆπi1 and P(yi=0|xi)=ˆπi0

Then the logistic regression model is

log(ˆπi11ˆπi1)=log(ˆπi1ˆπi0)=ˆβ0+ˆβ1xi

6

Binary Response (Logistic)

Suppose we consider y=0 the baseline category such that

P(yi=1|xi)=ˆπi1 and P(yi=0|xi)=ˆπi0

Then the logistic regression model is

log(ˆπi11ˆπi1)=log(ˆπi1ˆπi0)=ˆβ0+ˆβ1xi

Slope, ˆβ1: When x increases by one unit, the odds of y=1 versus the baseline y=0 are expected to multiply by a factor of exp{ˆβ1}

Intercept, ˆβ0: When x=0, the predicted odds of y=1 versus the baseline y=0 are exp{ˆβ0}

6

Multinomial response variable

  • Suppose the response variable y is categorical and can take values 1,2,,K such that (K>2)

  • Multinomial Distribution:

P(y=1)=π1,P(y=2)=π2,,P(y=K)=πK

such that Kk=1πk=1

7

Multinomial Logistic Regression

  • If we have an explanatory variable x, then we want to fit a model such that P(y=k)=πk is a function of x
8

Multinomial Logistic Regression

  • If we have an explanatory variable x, then we want to fit a model such that P(y=k)=πk is a function of x

  • Choose a baseline category. Let's choose y=1. Then,

log(πikπi1)=β0k+β1kxi

8

Multinomial Logistic Regression

  • If we have an explanatory variable x, then we want to fit a model such that P(y=k)=πk is a function of x

  • Choose a baseline category. Let's choose y=1. Then,

log(πikπi1)=β0k+β1kxi

  • In the multinomial logistic model, we have a separate equation for each category of the response relative to the baseline category
    • If the response has K possible categories, there will be K1 equations as part of the multinomial logistic model
8

Multinomial Logistic Regression

  • Suppose we have a response variable y that can take three possible outcomes that are coded as "A", "B", "C"

  • Let "A" be the baseline category. Then

log(πiBπiA)=β0B+β1Bxilog(πiCπiA)=β0C+β1Cxi

9

NHANES Data

  • National Health and Nutrition Examination Survey is conducted by the National Center for Health Statistics (NCHS)

  • The goal is to "assess the health and nutritional status of adults and children in the United States"

  • This survey includes an interview and a physical examination

10

NHANES Data

  • We will use the data from the NHANES R package

  • Contains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years

  • The data in this package is modified for educational purposes and should not be used for research

  • Original data can be obtained from the NCHS website for research purposes

  • Type ?NHANES in console to see list of variables and definitions

11

Health Rating vs. Age & Physical Activity

  • Question: Can we use a person's age and whether they do regular physical activity to predict their self-reported health rating?

  • We will analyze the following variables:

    • HealthGen: Self-reported rating of participant's health in general. Excellent, Vgood, Good, Fair, or Poor.

    • Age: Age at time of screening (in years). Participants 80 or older were recorded as 80.

    • PhysActive: Participant does moderate to vigorous-intensity sports, fitness or recreational activities

12

The data

## Rows: 6,710
## Columns: 4
## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood, …
## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56,…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, …
## $ obs_num <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
13

Exploratory data analysis

14

Exploratory data analysis

15

Model in R

  • Use the multinom() function in the nnet package
library(nnet)
health_m <- multinom(HealthGen ~ Age + PhysActive,
data = nhanes_adult)
  • Put results = "hide" in the code chunk header to suppress convergence output
16

Output results

tidy(health_m, conf.int = TRUE, exponentiate = FALSE) %>%
kable(digits = 3, format = "markdown")
17

Model output

y.level term estimate std.error statistic p.value conf.low conf.high
Vgood (Intercept) 1.205 0.145 8.325 0.000 0.922 1.489
Vgood Age 0.001 0.002 0.369 0.712 -0.004 0.006
Vgood PhysActiveYes -0.321 0.093 -3.454 0.001 -0.503 -0.139
Good (Intercept) 1.948 0.141 13.844 0.000 1.672 2.223
Good Age -0.002 0.002 -0.977 0.329 -0.007 0.002
Good PhysActiveYes -1.001 0.090 -11.120 0.000 -1.178 -0.825
Fair (Intercept) 0.915 0.164 5.566 0.000 0.592 1.237
Fair Age 0.003 0.003 1.058 0.290 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.000 -1.856 -1.435
Poor (Intercept) -1.521 0.290 -5.238 0.000 -2.090 -0.952
Poor Age 0.022 0.005 4.522 0.000 0.013 0.032
Poor PhysActiveYes -2.656 0.236 -11.275 0.000 -3.117 -2.194
18

Fair vs. Excellent Health

The baseline category for the model is Excellent.

19

Fair vs. Excellent Health

The baseline category for the model is Excellent.

The model equation for the log-odds a person rates themselves as having "Fair" health vs. "Excellent" is

log(ˆπFairˆπExcellent)=0.915+0.003 age1.645 PhysActive

19

Interpretations

log(ˆπFairˆπExcellent)=0.915+0.003 age1.645 PhysActive

For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.

20

Interpretations

log(ˆπFairˆπExcellent)=0.915+0.003 age1.645 PhysActive

For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.

The odds a person who does physical activity will rate themselves as having fair health versus excellent health are expected to be 0.193 (exp(-1.645 )) times the odds for a person who doesn't do physical activity, holding age constant.

20

Interpretations

log(ˆπFairˆπExcellent)=0.915+0.003 age1.645 PhysActive

The odds a 0 year old person who doesn't do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).

21

Interpretations

log(ˆπFairˆπExcellent)=0.915+0.003 age1.645 PhysActive

The odds a 0 year old person who doesn't do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).

⚠️ Need to mean-center age for the intercept to have a meaningful interpretation!

21

Hypothesis test for βjk

The test of significance for the coefficient βjk is

Hypotheses: H0:βjk=0 vs Ha:βjk0

Test Statistic: z=ˆβjk0SE(^βjk)

P-value: P(|Z|>|z|),

where ZN(0,1), the Standard Normal distribution

22

Confidence interval for βjk

  • We can calculate the C% confidence interval for βjk using the following:

ˆβjk±zSE(ˆβjk)

where z is calculated from the N(0,1) distribution

We are C% confident that for every one unit change in xj, the odds of y=k versus the baseline will multiply by a factor of exp{ˆβjkzSE(ˆβjk)} to exp{ˆβjk+zSE(ˆβjk)}, holding all else constant.

23

Interpreting confidence intervals for βjk

y.level term estimate std.error statistic p.value conf.low conf.high
Fair (Intercept) 0.915 0.164 5.566 0.00 0.592 1.237
Fair Age 0.003 0.003 1.058 0.29 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.00 -1.856 -1.435


24

Interpreting confidence intervals for βjk

y.level term estimate std.error statistic p.value conf.low conf.high
Fair (Intercept) 0.915 0.164 5.566 0.00 0.592 1.237
Fair Age 0.003 0.003 1.058 0.29 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.00 -1.856 -1.435


We are 95% confident, that for each additional year in age, the odds a person rates themselves as having fair health versus excellent health will multiply by 0.997 (exp(-0.003)) to 1.009 (exp(0.009)) , holding physical activity constant.

24

Interpreting confidence intervals for βjk

y.level term estimate std.error statistic p.value conf.low conf.high
Fair (Intercept) 0.915 0.164 5.566 0.00 0.592 1.237
Fair Age 0.003 0.003 1.058 0.29 -0.003 0.009
Fair PhysActiveYes -1.645 0.107 -15.319 0.00 -1.856 -1.435


We are 95% confident that the odds a person who does physical activity will rate themselves as having fair health versus excellent health are 0.156 (exp(-1.856 )) to 0.238 (exp(-1.435)) times the odds for a person who doesn't do physical activity, holding age constant.

25

Recap

  • Introduce multinomial logistic regression

  • Interpret model coefficients

  • Inference for a coefficient βjk

26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow