Lab 06: Houses in King County

due Sun, Mar 21, at 11:59p EDT

In this lab, you will use multiple linear regression to understand variability in the price of houses in King County, WA.

Learning goals

By the end of the lab you will be able to…

Getting started

Clone assignment repo + start new project

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.

Workflow: Using git and GitHub as a team

Now that you have had some experience collaborating as a team on GitHub, it will be up to your team to decide who types the responses to each question. Every team member should contribute to the discussion about each lab exercise even if they are not the one typing the team’s responses.

Every team member should have at least one meaningful commit to the repo on GitHub.

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal in RStudio:

git config --global credential.helper 'cache --timeout 604800'

Packages + data

Packages

We will use the following packages in today’s lab. Feel free to add any other packages as needed.

library(tidyverse)
library(broom)
library(knitr)

Data

The data for today’s lab contains the price and other characteristics of over 20,000 houses sold in King County, Washington (the county that includes Seattle). The data set includes the following variables:

houses <- read_csv("data/KingCountyHouses.csv") %>%
  mutate(waterfront = as.factor(waterfront))

Exercises

  1. Create a data visualization and calculate summary statistics to describe the distribution of bedrooms. Is this distribution what you expected? Briefly explain why or why not.

  2. We’ll limit the scope of the analysis to houses with 1 to 4 bedrooms. Filter the data set to only include these observations. Then, calculate the mean and standard deviation of bedrooms in the updated data. Display the results using 3 digits.

We will use this data set for the remainder of the lab.

  1. Visualize the distribution of price using a different type of visualization than used in Exercise 1. Then use the visualization and appropriate summary statistics to describe the distribution.

  2. We need to do a little more data wrangling before fitting the model.

You’ll use these variables instead of floors and yr_renovated in the analysis.

  1. Fit the full regression model using the characteristics in the data set to help explain variability in the price of houses in King County. Display the model using 3 digits for the numeric values.

  2. Now let’s check the model conditions. For each condition, state whether it is satisfied and briefly explain using any relevant plots or summary statistics.

  3. We want to refit the model using log(price), the log-transformed price, as the response variable. Briefly explain why we might want to refit the model with the new response variable.

  4. Refit the model with log(price) as the response variable. Display the model using 5 digits for the numeric values.

  5. Interpret the following effects in terms of (1) the change in the log(price) and (2) the change in the price of houses in King County:

    • when the house was built
    • whether the house is on the waterfront
  6. Based on the model, how do you expect the price to differ for a one-story house that is 1500 square feet compared to a two-story house that is 2000 square feet? Assume all other characteristics are the same for the two houses. The response should be in terms of price, not log(price).

Submission

One team member: Upload the team’s PDF to Gradescope. Be sure to include every team member’s name in the Gradescope submission Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.

There should only be one submission per team on Gradescope.

Note that your submission on Gradescope should be identical to the PDF that is rendered when we knit the Rmd file in your repo.

Acknowledgements

The data used in this lab is from https://github.com/proback/BYSH.