+ - 0:00:00
Notes for current slide
Notes for next slide

Variable transformations

Prof. Maria Tackett

1

Topics

  • Log transformation on the response

  • Log transformation on the predictor

3

Respiratory Rate vs. Age

  • A high respiratory rate can potentially indicate a respiratory infection in children. In order to determine what indicates a "high" rate, we first want to understand the relationship between a child's age and their respiratory rate.

  • The data contain the respiratory rate for 618 children ages 15 days to 3 years.

  • Variables:

    • Age: age in months
    • Rate: respiratory rate (breaths per minute)
4

Rate vs. Age

5

Rate vs. Age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 47.052 0.504 93.317 0 46.062 48.042
Age -0.696 0.029 -23.684 0 -0.753 -0.638

6

Log transformation on the response

7

Need to transform Y

  • Typically, a "fan-shaped" residual plot indicates the need for a transformation of the response variable y
    • log(Y) is the most straightforward to interpret
8

Need to transform Y

  • Typically, a "fan-shaped" residual plot indicates the need for a transformation of the response variable y
    • log(Y) is the most straightforward to interpret
  • When building a model:
    • Choose a transformation and build the model on the transformed data
    • Reassess the residual plots
    • If the residuals plots did not sufficiently improve, try a new transformation!
8

Log transformation on Y

  • If we apply a log transformation to the response variable, we want to estimate the parameters for the model...

    ^log(Y)=ˆβ0+ˆβ1X

9

Log transformation on Y

  • If we apply a log transformation to the response variable, we want to estimate the parameters for the model...

    ^log(Y)=ˆβ0+ˆβ1X

  • We want to interpret the model in terms of y not log(Y), so we write all interpretations in terms of

y=exp{ˆβ0+ˆβ1X}=exp{ˆβ0}exp{ˆβ1X}

9

Mean and logs

Suppose we have a set of values

x <- c(3, 5, 6, 8, 10, 14, 19)
10

Mean and logs

Suppose we have a set of values

x <- c(3, 5, 6, 8, 10, 14, 19)

Let's calculate ¯log(x)

log_x <- log(x)
mean(log_x)
## [1] 2.066476
10

Mean and logs

Suppose we have a set of values

x <- c(3, 5, 6, 8, 10, 14, 19)

Let's calculate ¯log(x)

log_x <- log(x)
mean(log_x)
## [1] 2.066476

Let's calculate log(ˉx)

xbar <- mean(x)
log(xbar)
## [1] 2.228477
10

Median and logs

x <- c(3, 5, 6, 8, 10, 14, 19)
11

Median and logs

x <- c(3, 5, 6, 8, 10, 14, 19)

Let's calculate Median(log(x))

log_x <- log(x)
median(log_x)
## [1] 2.079442
11

Median and logs

x <- c(3, 5, 6, 8, 10, 14, 19)

Let's calculate Median(log(x))

log_x <- log(x)
median(log_x)
## [1] 2.079442

Let's calculate log(Median(x))

median_x <- median(x)
log(median_x)
## [1] 2.079442
11

Mean, Median, and log

12

Mean, Median, and log

¯log(x)log(ˉx)

mean(log_x) == log(xbar)
## [1] FALSE
12

Mean, Median, and log

¯log(x)log(ˉx)

mean(log_x) == log(xbar)
## [1] FALSE

Median(log(x))=log(Median(x))

median(log_x) == log(median_x)
## [1] TRUE
12

Mean and median of log(Y)

  • Recall that y=β0+β1xi is the mean value of y at the given value xi. This doesn't hold when we log-transform y
13

Mean and median of log(Y)

  • Recall that y=β0+β1xi is the mean value of y at the given value xi. This doesn't hold when we log-transform y

  • The mean of the logged values is not equal to the log of the mean value. Therefore at a given value of x

exp{Mean(log(y))}Mean(y)exp{β0+β1x}Mean(y)

13

Mean and median of log(y)

  • However, the median of the logged values is equal to the log of the median value. Therefore,

exp{Median(log(y))}=Median(y)

14

Mean and median of log(y)

  • However, the median of the logged values is equal to the log of the median value. Therefore,

exp{Median(log(y))}=Median(y)

  • If the distribution of log(y) is symmetric about the regression line, for a given value xi,

    Median(log(y))=Mean(log(y))

14

Interpretation with log-transformed y

  • Given the previous facts, if ^log(Y)=ˆβ0+ˆβ1x, then

    Median(ˆY)=exp{ˆβ0}exp{ˆβ1x}



  • Intercept: When X=0, the median of Y is expected to be exp{ˆβ0}
  • Slope: For every one unit increase in X, the median of Y is expected to multiply by a factor of exp{ˆβ1}
15

log(Rate) vs. Age

16

log(Rate) vs. Age

17

log(Rate) vs. Age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3.845 0.013 304.500 0 3.82 3.870
Age -0.019 0.001 -25.839 0 -0.02 -0.018
18

log(Rate) vs. Age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3.845 0.013 304.500 0 3.82 3.870
Age -0.019 0.001 -25.839 0 -0.02 -0.018


Intercept: The median respiratory rate for a new born child is expected to be 46.759 (exp{3.845}) breaths per minute.

18

log(Rate) vs. Age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3.845 0.013 304.500 0 3.82 3.870
Age -0.019 0.001 -25.839 0 -0.02 -0.018


Intercept: The median respiratory rate for a new born child is expected to be 46.759 (exp{3.845}) breaths per minute.

Slope: For each additional month in a child's age, the respiratory rate is expected to multiply by a factor of 0.981 (exp{-0.019}).

18

Confidence interval for βj

  • The confidence interval for the coefficient of X describing its relationship with log(Y) is

ˆβj±tSE(^βj)

19

Confidence interval for βj

  • The confidence interval for the coefficient of X describing its relationship with log(Y) is

ˆβj±tSE(^βj)

  • The confidence interval for the coefficient of x describing its relationship with Y is

exp{ˆβj±tSE(^βj)}

19

Coefficient of Age

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3.845 0.013 304.500 0 3.82 3.870
Age -0.019 0.001 -25.839 0 -0.02 -0.018

We are 95% confident that for each additional month in age, the respiratory rate will multiply by a factor of 0.98 to 0.982 (exp{-0.02} to exp{-0.018}).

20

Log transformation on the predictor

21

Log Transformation on X

Try a transformation on X if the scatterplot shows some curvature but the variance is constant for all values of X

22

Model with Transformation on X

ˆY=ˆβ0+ˆβ1log(X)


23

Model with Transformation on X

ˆY=ˆβ0+ˆβ1log(X)


  • Intercept: When log(X)=0, (X=1), Y is expected to be ˆβ0 (i.e. the mean of y is ˆβ0)
23

Model with Transformation on X

ˆY=ˆβ0+ˆβ1log(X)


  • Intercept: When log(X)=0, (X=1), Y is expected to be ˆβ0 (i.e. the mean of y is ˆβ0)

  • Slope: When X is multiplied by a factor of C, the mean of Y is expected to change by ˆβ1log(C) units

    • Example: when X is multiplied by a factor of 2, y is expected to change by ˆβ1log(2) units
23

Rate vs. log(Age)

24

Rate vs. log(Age)

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 50.135 0.632 79.330 0 48.893 51.376
log_age -5.982 0.263 -22.781 0 -6.498 -5.467
25

Rate vs. log(Age)

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 50.135 0.632 79.330 0 48.893 51.376
log_age -5.982 0.263 -22.781 0 -6.498 -5.467


Intercept: The expected (mean) respiratory rate for children who are 1 month old (log(1) = 0) is 50.135 breaths per minute.

25

Rate vs. log(Age)

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 50.135 0.632 79.330 0 48.893 51.376
log_age -5.982 0.263 -22.781 0 -6.498 -5.467


Intercept: The expected (mean) respiratory rate for children who are 1 month old (log(1) = 0) is 50.135 breaths per minute.

Slope: If a child's age doubles, we expect their respiratory rate to decrease by 4.146 (-5.982*log(2)) breaths per minute.

25

See Log Transformations in Linear Regression for more details about interpreting regression models with log-transformed variables.

26

Recap

  • Log transformation on the response

  • Log transformation on the predictor

27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow