Comparing means with ANOVA

Comparing means with ANOVAProf. Maria Tackett1

Click here for PDF of slides

Topics3

TopicsCompare groups using analysis of variance
3

TopicsCompare groups using analysis of variance
3

Aldrin in the Wolf River

The Wolf River in Tennessee flows past an abandoned site once used by the pesticide industry for dumping wastes, including chlordane (pesticide), aldrin, and dieldrin (both insecticides).
These highly toxic organic compounds can cause various cancers and birth defects.

Aldrin in the Wolf River

## # A tibble: 30 x 2
##    aldrin depth 
##     <dbl> <chr> 
##  1    3.8 bottom
##  2    4.8 bottom
##  3    4.9 bottom
##  4    5.3 bottom
##  5    5.4 bottom
##  6    5.7 bottom
##  7    6.3 bottom
##  8    7.3 bottom
##  9    8.1 bottom
## 10    8.8 bottom
## # … with 20 more rows

Aldrin in the Wolf River

The standard methods to test whether these substances are present in a river is to take samples at six-tenths depth.

These compounds are denser than water and their molecules tend to stick to particles of sediment, they are more likely to be found in higher concentrations near the bottom than near mid-depth.

Is there a difference between the mean aldrin concentrations among the three depth levels?7

Aldrin by depth

depth	n	mean	sd
bottom	10	6.04	1.579
middepth	10	5.05	1.104
surface	10	4.20	0.660

So far, we have used a quantitative predictor variable to understand the variation in a quantitative response variable.

Now, we will use a categorical (qualitative) predictor variable to understand the variation in a quantitative response variable.

NotationK is number of mutually exclusive groups. We index the groups as i=1,…,K.

10

Notation

is number of mutually exclusive groups. We index the groups as .
is number of observations in group

Notation

is number of mutually exclusive groups. We index the groups as .
is number of observations in group
is the total number of observations in the data

Notation

is number of mutually exclusive groups. We index the groups as .
is number of observations in group
is the total number of observations in the data
is the observation in group , for all

Notation

is number of mutually exclusive groups. We index the groups as .
is number of observations in group
is the total number of observations in the data
is the observation in group , for all
is the population mean for group , for

Using ANOVA to compare means

Question of interest Is the mean value of the response the same for all groups, or is there at least one group with a significantly different mean value?
To answer this question, we will test the following hypotheses:

What's happening...

If the sample means are "far apart", " there is evidence against

We will calculate a test statistic to quantify "far apart" in the context of the data

Analysis of Variance (ANOVA)

Main Idea: Decompose the total variation in the data into the variation between groups (model) and the variation within each group (residuals)

Analysis of Variance (ANOVA)

Main Idea: Decompose the total variation in the data into the variation between groups (model) and the variation within each group (residuals)

If the variation between groups is significantly greater than the variation within each group, then there is evidence against the null hypothesis.

ANOVA table
 
    term 
    df 
    sumsq 
    meansq 
    statistic 
    p.value 
  


    depth 
    2 
    16.961 
    8.480 
    6.134 
    0.006 
  

    Residuals 
    27 
    37.329 
    1.383 
     
     
  

14

Total variation
 
    term 
    df 
    sumsq 
    meansq 
    statistic 
    p.value 
  


    depth 
    2 
    16.961 
    8.480 
    6.134 
    0.006 
  

    Residuals 
    27 
    37.329 
    1.383 
     
     
  

15

Total variation

term	df	sumsq	meansq	statistic	p.value
depth	2	16.961	8.480	6.134	0.006
Residuals	27	37.329	1.383

Total variation: variation between and within groups

Between variation
 
    term 
    df 
    sumsq 
    meansq 
    statistic 
    p.value 
  


    depth 
    2 
    16.961 
    8.480 
    6.134 
    0.006 
  

    Residuals 
    27 
    37.329 
    1.383 
     
     
  

16

Between variation

term	df	sumsq	meansq	statistic	p.value
depth	2	16.961	8.480	6.134	0.006
Residuals	27	37.329	1.383

Between variation: variation in the group means

Within variation
 
    term 
    df 
    sumsq 
    meansq 
    statistic 
    p.value 
  


    depth 
    2 
    16.961 
    8.480 
    6.134 
    0.006 
  

    Residuals 
    27 
    37.329 
    1.383 
     
     
  

17

Within variation

term	df	sumsq	meansq	statistic	p.value
depth	2	16.961	8.480	6.134	0.006
Residuals	27	37.329	1.383

Within variation: variation within each group

Using ANOVA table to test difference in means

term	df	sumsq	meansq	statistic	p.value
depth	2	16.961	8.480	6.134	0.006
Residuals	27	37.329	1.383

Using ANOVA table to test difference in means

term	df	sumsq	meansq	statistic	p.value
depth	2	16.961	8.480	6.134	0.006
Residuals	27	37.329	1.383

Test statistic: Ratio of between group and within group variation

Calculate p-value

Calculate the p-value using an F distribution with and degrees of freedom

Using ANOVA table to test difference in means

term	df	sumsq	meansq	statistic	p.value
depth	2	16.961	8.480	6.134	0.006
Residuals	27	37.329	1.383

P-value: Probability of observing a test statistic at least as extreme as F Stat given the group means are equal

Using ANOVA table to test difference in means

term	df	sumsq	meansq	statistic	p.value
depth	2	16.961	8.480	6.134	0.006
Residuals	27	37.329	1.383

P-value: Probability of observing a test statistic at least as extreme as F Stat given the group means are equal

The p-value is very small , so we reject . The data provide sufficient evidence that at least one depth level has a mean aldrin concentration that differs from the others.

Assumptions for ANOVA22

Assumptions for ANOVA23

Assumptions for ANOVA

1️⃣ Normality:

Assumptions for ANOVA

1️⃣ Normality:

2️⃣ Constant variance: The population distribution for each group has a common variance,

Assumptions for ANOVA

1️⃣ Normality:

2️⃣ Constant variance: The population distribution for each group has a common variance,

3️⃣ Independence: The observations are independent from each other

This applies to observations within and between groups

Assumptions for ANOVA

1️⃣ Normality:

2️⃣ Constant variance: The population distribution for each group has a common variance,

3️⃣ Independence: The observations are independent from each other

This applies to observations within and between groups

For ANOVA, we can typically check these assumptions in the exploratory data analysis

Checking Normality

✅ No major skewness or outliers.

Checking Normality

✅ Points fall relatively along the diagonal line.

Checking constant variance

## # A tibble: 3 x 4
##   depth        n  mean    sd
## * <chr>    <int> <dbl> <dbl>
## 1 bottom      10  6.04 1.58 
## 2 middepth    10  5.05 1.10 
## 3 surface     10  4.2  0.660

✅ The maximum standard deviation is about 2.4 times the smallest one. This is OK given the small sample size.

Checking independence

✅ Based on what we know about the study, we have no reason to believe that the aldrin concentrations are not independent of each other.

Robustness to Assumptions28

Robustness to AssumptionsNormality: yij∼N(μi,σ2)ANOVA relatively robust to departures from Normality. 
Concern when there are strongly skewed distributions with different sample sizes (especially if sample sizes are small, < 10 in each group)

28

Robustness to Assumptions

Normality:
- ANOVA relatively robust to departures from Normality.
- Concern when there are strongly skewed distributions with different sample sizes (especially if sample sizes are small, < 10 in each group)
Independence: There is independence within and across groups
- If this doesn't hold, should use methods that account for correlated errors

Robustness to AssumptionsConstant variance: The population distribution for each group has a common variance, σ2Critical assumption, since the pooled (combined) variance is important for ANOVA
General rule: Satisfied if SDmax/SDmin≤2. OK if this is somewhat >2 when sample sizes are small.

29

Recap30

RecapUsed ANOVA to compare means across groups
30

Acknowledgements

Analysis example and map image from OpenIntro Statistics

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help