Multinomial Logistic Regression

Multinomial Logistic RegressionIntroductionProf. Maria Tackett1

Click for PDF of slides

Topics

Introduce multinomial logistic regression
Interpret model coefficients
Inference for a coefficient

Generalized Linear Models (GLM)

In practice, there are many different types of response variables including:
- Binary: Win or Lose
- Nominal: Democrat, Republican or Third Party candidate
- Ordered: Movie rating (1 - 5 stars)
- and others...
These are all examples of generalized linear models, a broader class of models that generalize the multiple linear regression model
See Generalized Linear Models: A Unifying Theory for more details about GLMs

Binary Response (Logistic)

Given

We can calculate by solving the logit equation:

Binary Response (Logistic)

Suppose we consider the baseline category such that

Binary Response (Logistic)

Suppose we consider the baseline category such that

Then the logistic regression model is

Binary Response (Logistic)

Suppose we consider the baseline category such that

Then the logistic regression model is

Slope, : When increases by one unit, the odds of versus the baseline are expected to multiply by a factor of

Intercept, : When , the predicted odds of versus the baseline are

Multinomial response variable

Suppose the response variable is categorical and can take values such that
Multinomial Distribution:

such that

Multinomial Logistic RegressionIf we have an explanatory variable x, then we want to fit a model such that P(y=k)=πk is a function of x
8

Multinomial Logistic Regression

If we have an explanatory variable , then we want to fit a model such that is a function of
Choose a baseline category. Let's choose . Then,

Multinomial Logistic Regression

If we have an explanatory variable , then we want to fit a model such that is a function of
Choose a baseline category. Let's choose . Then,

In the multinomial logistic model, we have a separate equation for each category of the response relative to the baseline category
- If the response has possible categories, there will be equations as part of the multinomial logistic model

Multinomial Logistic Regression

Suppose we have a response variable that can take three possible outcomes that are coded as "A", "B", "C"
Let "A" be the baseline category. Then

NHANES Data

National Health and Nutrition Examination Survey is conducted by the National Center for Health Statistics (NCHS)
The goal is to "assess the health and nutritional status of adults and children in the United States"
This survey includes an interview and a physical examination

NHANES Data

We will use the data from the NHANES R package
Contains 75 variables for the 2009 - 2010 and 2011 - 2012 sample years
The data in this package is modified for educational purposes and should not be used for research
Original data can be obtained from the NCHS website for research purposes
Type ?NHANES in console to see list of variables and definitions

Health Rating vs. Age & Physical Activity

Question: Can we use a person's age and whether they do regular physical activity to predict their self-reported health rating?
We will analyze the following variables:
- HealthGen: Self-reported rating of participant's health in general. Excellent, Vgood, Good, Fair, or Poor.
- Age: Age at time of screening (in years). Participants 80 or older were recorded as 80.
- PhysActive: Participant does moderate to vigorous-intensity sports, fitness or recreational activities

The data

## Rows: 6,710
## Columns: 4
## $ HealthGen  <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, Vgood, …
## $ Age        <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 56, 56,…
## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, No, …
## $ obs_num    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

Exploratory data analysis

Model in R

Use the multinom() function in the nnet package

library(nnet)
health_m <- multinom(HealthGen ~ Age + PhysActive, 
                     data = nhanes_adult)

Put results = "hide" in the code chunk header to suppress convergence output

Output results

tidy(health_m, conf.int = TRUE, exponentiate = FALSE) %>%
  kable(digits = 3, format = "markdown")

Model output

y.level
term
estimate
std.error
statistic
p.value
conf.low
conf.high


Vgood
(Intercept)
1.205
0.145
8.325
0.000
0.922
1.489

Vgood
Age
0.001
0.002
0.369
0.712
-0.004
0.006

Vgood
PhysActiveYes
-0.321
0.093
-3.454
0.001
-0.503
-0.139

Good
(Intercept)
1.948
0.141
13.844
0.000
1.672
2.223

Good
Age
-0.002
0.002
-0.977
0.329
-0.007
0.002

Good
PhysActiveYes
-1.001
0.090
-11.120
0.000
-1.178
-0.825

Fair
(Intercept)
0.915
0.164
5.566
0.000
0.592
1.237

Fair
Age
0.003
0.003
1.058
0.290
-0.003
0.009

Fair
PhysActiveYes
-1.645
0.107
-15.319
0.000
-1.856
-1.435

Poor
(Intercept)
-1.521
0.290
-5.238
0.000
-2.090
-0.952

Poor
Age
0.022
0.005
4.522
0.000
0.013
0.032

Poor
PhysActiveYes
-2.656
0.236
-11.275
0.000
-3.117
-2.194


18

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
Vgood	(Intercept)	1.205	0.145	8.325	0.000	0.922	1.489
Vgood	Age	0.001	0.002	0.369	0.712	-0.004	0.006
Vgood	PhysActiveYes	-0.321	0.093	-3.454	0.001	-0.503	-0.139
Good	(Intercept)	1.948	0.141	13.844	0.000	1.672	2.223
Good	Age	-0.002	0.002	-0.977	0.329	-0.007	0.002
Good	PhysActiveYes	-1.001	0.090	-11.120	0.000	-1.178	-0.825
Fair	(Intercept)	0.915	0.164	5.566	0.000	0.592	1.237
Fair	Age	0.003	0.003	1.058	0.290	-0.003	0.009
Fair	PhysActiveYes	-1.645	0.107	-15.319	0.000	-1.856	-1.435
Poor	(Intercept)	-1.521	0.290	-5.238	0.000	-2.090	-0.952
Poor	Age	0.022	0.005	4.522	0.000	0.013	0.032
Poor	PhysActiveYes	-2.656	0.236	-11.275	0.000	-3.117	-2.194

Fair vs. Excellent Health

The baseline category for the model is Excellent.

Fair vs. Excellent Health

The baseline category for the model is Excellent.

The model equation for the log-odds a person rates themselves as having "Fair" health vs. "Excellent" is

Interpretations

For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.

Interpretations

For each additional year in age, the odds a person rates themselves as having fair health versus excellent health are expected to multiply by 1.003 (exp(0.003)), holding physical activity constant.

The odds a person who does physical activity will rate themselves as having fair health versus excellent health are expected to be 0.193 (exp(-1.645 )) times the odds for a person who doesn't do physical activity, holding age constant.

Interpretations

The odds a 0 year old person who doesn't do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).

Interpretations

The odds a 0 year old person who doesn't do physical activity rates themselves as having fair health vs. excellent health are 2.497 (exp(0.915)).

⚠️ Need to mean-center age for the intercept to have a meaningful interpretation!

Hypothesis test for

The test of significance for the coefficient is

Hypotheses:

Test Statistic:

P-value: ,

where , the Standard Normal distribution

Confidence interval for

We can calculate the C% confidence interval for using the following:

where

is calculated from the

distribution

We are confident that for every one unit change in , the odds of versus the baseline will multiply by a factor of to , holding all else constant.

Interpreting confidence intervals for

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
Fair	(Intercept)	0.915	0.164	5.566	0.00	0.592	1.237
Fair	Age	0.003	0.003	1.058	0.29	-0.003	0.009
Fair	PhysActiveYes	-1.645	0.107	-15.319	0.00	-1.856	-1.435

Interpreting confidence intervals for

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
Fair	(Intercept)	0.915	0.164	5.566	0.00	0.592	1.237
Fair	Age	0.003	0.003	1.058	0.29	-0.003	0.009
Fair	PhysActiveYes	-1.645	0.107	-15.319	0.00	-1.856	-1.435

We are 95% confident, that for each additional year in age, the odds a person rates themselves as having fair health versus excellent health will multiply by 0.997 (exp(-0.003)) to 1.009 (exp(0.009)) , holding physical activity constant.

Interpreting confidence intervals for

y.level	term	estimate	std.error	statistic	p.value	conf.low	conf.high
Fair	(Intercept)	0.915	0.164	5.566	0.00	0.592	1.237
Fair	Age	0.003	0.003	1.058	0.29	-0.003	0.009
Fair	PhysActiveYes	-1.645	0.107	-15.319	0.00	-1.856	-1.435

We are 95% confident that the odds a person who does physical activity will rate themselves as having fair health versus excellent health are 0.156 (exp(-1.856 )) to 0.238 (exp(-1.435)) times the odds for a person who doesn't do physical activity, holding age constant.

Recap

Introduce multinomial logistic regression
Interpret model coefficients
Inference for a coefficient

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Multinomial Logistic Regression

Introduction

Prof. Maria Tackett

Click for PDF of slides

Topics

Generalized Linear Models (GLM)

Binary Response (Logistic)

Binary Response (Logistic)

Binary Response (Logistic)

Binary Response (Logistic)

Multinomial response variable

Multinomial Logistic Regression

Multinomial Logistic Regression

Multinomial Logistic Regression

Multinomial Logistic Regression

NHANES Data

NHANES Data

Health Rating vs. Age & Physical Activity

The data

Exploratory data analysis

Exploratory data analysis

Model in R

Output results

Model output

Fair vs. Excellent Health

Fair vs. Excellent Health

Interpretations

Interpretations

Interpretations

Interpretations

Hypothesis test for βjk

Confidence interval for βjk

Interpreting confidence intervals for βjk

Interpreting confidence intervals for βjk

Interpreting confidence intervals for βjk

Recap

Click for PDF of slides

Help

Hypothesis test for

Confidence interval for

Interpreting confidence intervals for

Interpreting confidence intervals for

Interpreting confidence intervals for