Multiple linear regression

# Multiple linear regression
## Inference
### Prof. Maria Tackett

---

## [Click here for PDF of slides](11-mlr-inference.pdf)

---

## Topics

- Conduct a hypothesis test for `$\beta_j$`

- Calculate a confidence interval for `$\beta_j$`

- Quick overview of math details for MLR

---

## House prices in Levittown

The data set contains the sales price and characteristics of  85 homes in Levittown, NY that sold between June 2010 and May 2011.

We would like to use the characteristics of a house to understand variability in the sales price.

---

## Variables

**Predictors**
- .vocab[`bedrooms`]: Number of bedrooms
- .vocab[`bathrooms`]: Number of bathrooms
- .vocab[`living_area`]: Total living area of the house (in square feet)
- .vocab[`lot_size`]: Total area of the lot (in square feet)
- .vocab[`year_built`]: Year the house was built
- .vocab[`property_tax`]: Annual property taxes (in U.S. dollars)

**Response**
- .vocab[`sale_price`]: Sales price (in U.S. dollars)

---

## EDA: Response variable

---

## EDA: Response vs. Predictors

---

## Home price model

|term         |     estimate|   std.error| statistic| p.value|
|:------------|------------:|-----------:|---------:|-------:|
|(Intercept)  | -7148818.957| 3820093.694|    -1.871|   0.065|
|bedrooms     |   -12291.011|    9346.727|    -1.315|   0.192|
|bathrooms    |    51699.236|   13094.170|     3.948|   0.000|
|living_area  |       65.903|      15.979|     4.124|   0.000|
|lot_size     |       -0.897|       4.194|    -0.214|   0.831|
|year_built   |     3760.898|    1962.504|     1.916|   0.059|
|property_tax |        1.476|       2.832|     0.521|   0.604|
]

---

## Hypothesis test for `$\beta_j$`

---

## Outline of a hypothesis test

1️⃣ State the hypotheses.

<br>

2️⃣ Calculate the test statistic.

<br>

3️⃣ Calculate the p-value.

<br>

4️⃣ State the conclusion.

---

## 1️⃣ State the hypotheses

.eq[
`$$\begin{align}
&H_0: \beta_{living\_area} = 0 \\
&H_a: \beta_{living\_area} \neq 0\end{align}$$`
]

---

## 2️⃣ Calculate the test statistic

---

## 2️⃣ Calculate the test statistic

The estimated slope, 65.903, is 4.124 standard errors above the hypothesized mean, 0.

---

## 3️⃣ Calculate the p-value

---

## 3️⃣ Calculate the p-value

The p-value is calculated using a `$t$` distribution with `$\color{purple}{n-p-1}$` degrees of freedom, where `$p$` is the number of coefficients in the model.

In this example, the p-value is calculated using a `$t$` distribution with `$\color{purple}{85-6-1 = 78}$` degrees of freedom.

.alert[ Given `$\beta_{living\_area}  = 0$` the probability of observing a coefficient at least as extreme as the one we've observed, 65.903, is 0.00009.
]

---

## 4️⃣ State the conclusion

<font class = "vocab">The p-value is very small, so we reject `$H_0$`. The data provide sufficient evidence that the living area is a helpful predictor in the model explaining some of the variability in price.

---

## Confidence interval for `$\beta_j$`

---

## Confidence Interval for `$\beta_j$`

.eq[
The `$C%$` confidence interval for `$\beta_j$` 
`$$\hat{\beta}_j \pm t^* SE(\hat{\beta}_j)$$`
where `$t^*$` follows a `$t$` distribution with `$n - p - 1$` degrees of freedom
]

**General Interpretation**: We are `$C%$` confident that the interval LB to UB contains the population coefficient of `$x_j$`. Therefore, for every one unit increase in `$x_j$`, we expect `$y$` to change by LB to UB units, holding all else constant.

---

## Confidence interval for `living_area`

.tiny[
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -7148818.957 </td>
   <td style="text-align:right;"> 3820093.694 </td>
   <td style="text-align:right;"> -1.871 </td>
   <td style="text-align:right;"> 0.065 </td>
   <td style="text-align:right;"> -14754041.291 </td>
   <td style="text-align:right;"> 456403.376 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bedrooms </td>
   <td style="text-align:right;"> -12291.011 </td>
   <td style="text-align:right;"> 9346.727 </td>
   <td style="text-align:right;"> -1.315 </td>
   <td style="text-align:right;"> 0.192 </td>
   <td style="text-align:right;"> -30898.915 </td>
   <td style="text-align:right;"> 6316.893 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> bathrooms </td>
   <td style="text-align:right;"> 51699.236 </td>
   <td style="text-align:right;"> 13094.170 </td>
   <td style="text-align:right;"> 3.948 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 25630.746 </td>
   <td style="text-align:right;"> 77767.726 </td>
  </tr>
  <tr>
   <td style="text-align:left;background-color: #dce5b2 !important;"> living_area </td>
   <td style="text-align:right;background-color: #dce5b2 !important;"> 65.903 </td>
   <td style="text-align:right;background-color: #dce5b2 !important;"> 15.979 </td>
   <td style="text-align:right;background-color: #dce5b2 !important;"> 4.124 </td>
   <td style="text-align:right;background-color: #dce5b2 !important;"> 0.000 </td>
   <td style="text-align:right;background-color: #dce5b2 !important;"> 34.091 </td>
   <td style="text-align:right;background-color: #dce5b2 !important;"> 97.715 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> lot_size </td>
   <td style="text-align:right;"> -0.897 </td>
   <td style="text-align:right;"> 4.194 </td>
   <td style="text-align:right;"> -0.214 </td>
   <td style="text-align:right;"> 0.831 </td>
   <td style="text-align:right;"> -9.247 </td>
   <td style="text-align:right;"> 7.453 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> year_built </td>
   <td style="text-align:right;"> 3760.898 </td>
   <td style="text-align:right;"> 1962.504 </td>
   <td style="text-align:right;"> 1.916 </td>
   <td style="text-align:right;"> 0.059 </td>
   <td style="text-align:right;"> -146.148 </td>
   <td style="text-align:right;"> 7667.944 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> property_tax </td>
   <td style="text-align:right;"> 1.476 </td>
   <td style="text-align:right;"> 2.832 </td>
   <td style="text-align:right;"> 0.521 </td>
   <td style="text-align:right;"> 0.604 </td>
   <td style="text-align:right;"> -4.163 </td>
   <td style="text-align:right;"> 7.115 </td>
  </tr>
</tbody>
</table>
]
.vocab[
We are 95% confident that for every one additional square foot in living area, we expect the price to increase by $34.09 to $97.71, holding all other characteristics constant.
]

---

## 🛑 Caution: Large sample sizes

If the sample size is large enough, the test will likely result in rejecting `$H_0: \beta_j=0$` even `$x_j$` has a very small effect on `$y$`

- Consider the .vocab[practical significance] of the result not just the statistical significance

- Use the confidence interval to draw conclusions instead of relying only p-values

---

## 🛑 Caution: Small sample sizes

If the sample size is small, there may not be enough evidence to reject `$H_0: \beta_j=0$`

- When you fail to reject the null hypothesis, **DON'T** immediately conclude that the variable has no association with the response. 
  
- There may be a linear association that is just not strong enough to detect given your data, or there may be a non-linear association.

---

## Math details

---

## Regression Model

The multiple linear regression model assumes

.eq[
`$$Y|X_1, X_2,  \ldots, X_p \sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p, \sigma_\epsilon^2)$$`
]

For a given observation `$(x_{i1}, x_{i2}, \ldots, x_{ip}, y_i)$`, we can rewrite the previous statement as

.eq[
`$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_{i} \hspace{10mm} \epsilon_i \sim N(0,\sigma^2)$$`
]

---

## Estimating `$\sigma_\epsilon^2$`

For a given observation `$(x_{i1}, x_{i2}, \ldots,x_{ip}, y_i)$` the residual is

.eq[
`$$e_i = y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip})$$`
]

The estimated value of the regression variance , `$\sigma_{\epsilon}^2$`, is

---

## Estimating Coefficients

One way to estimate the coefficients is by taking partial derivatives of the formula

.eq[
`$$\sum_{i=1}^n e_i^2 = \sum_{i=1}^{n}[y_{i} - (\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta}_{2} x_{i2} + \dots + \hat{\beta}_p x_{ip})]^2$$`
]

This produces messy formulas, so instead we can use matrix notation for multiple linear regression and estimate the coefficients using rules from linear algebra. For more details, see [A Matrix Formulation of the Multiple Regression Model](https://online.stat.psu.edu/stat462/node/132/).

---

## Recap

- Conduct a hypothesis test for `$\beta_j$`

- Calculate a confidence interval for `$\beta_j$`

- Quick overview of math details for MLR