Getting Started

You can find the hw-02 repo in the sta210-sp21 organization on GitHub. This repo contains the starter documents and data set needed to complete the lab.

See the Lab 01 instructions for more details about cloning the repo, starting a new RStudio project, and configuring git.

Tips

Here are some tips as you complete HW 02:

  • Periodically knit your document and commit changes (the more often the better, for example, once per each new task). You should have at least 3 commit messages by the end of the assignment.
  • Push all your changes back to your GitHub repo. The Git pane in RStudio should be empty after you push. You can also check your assignment repo on github.com to make sure it has updated.
  • Once you have completed your work, you will submit your assignment in Gradescope. You are welcome to resubmit multiple times until the assignment deadline. We will grade the most recent version of the .pdf file in Gradescope.

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr)
#add other packages as needed

Questions

Part 1: Conceptual questions

Instructions

This section of homework contains short answer questions focused on the concepts discussed in class. Some of these questions may also include short computing tasks.

All narrative should be written in complete sentences, and all visualizations should have informative titles and axis labels.

Are high energy songs more positive? To answer this question, we’ll analyze data on some of the most popular songs on Spotify, i.e. those with track_popularity >= 80. We’ll use linear regression to fit a model to predict a song’s positiveness (valence) based on its energy level (energy).

  1. Create a plot to visualize the relationship between a song’s positiveness and energy level. Interpret the plot in the context of the data.

  2. Write the equation of the statistical model using mathematical notation.

  3. Fit the regression model. Display the output using 3 digits and include a short and informative caption.

    • Interpret the intercept in the context of the data.
    • Interpret the slope in the context of the data.
  4. What percent of the variation in valence is explained by a song’s energy? Calculate this value using the ANOVA table. Show how you calculated this answer.

  5. Next, let’s check the model conditions. First, let’s start with the linearity condition. Is this condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.

  6. Is the constant variance condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.

  7. Is the Normality condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.

  8. Is the independence condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.

  9. Let’s finish by using the model to make predictions:

    • Obtain the predicted value and appropriate 90% interval for the mean valence of the subset of songs with energy of 0.8. Interpret the interval in the context of the data.

    • Obtain the predicted value and appropriate 90% interval for the valence of the song “Put Your Records On” by Corinne Bailey Rae, which has an energy of 0.55. Interpret the interval in the contextt of the data.

Part 2: Communicating results

Instructions

Part of a statistician’s / data scientist’s job is being able to communicate technical results to a general (often non-technical) audience. This section of the homework will focus specifically on communication and writing some of the results for a general audience.

  1. Suppose a radio station wants to use your model to help them determine which songs to play based on a song’s predicted positiveness. What are some potential limitations the radio station should consider before using your model? Limitations may be related to the data itself, the statistical methods, model conditions, etc.

    Write a short paragraph for the radio station manager (about 3 - 5 sentences) describing some potential limitations. You can assume the manager has taken an introductory statistics class and understands basic statistical terms. You may include any plots or tables as needed.

Questions about R + simple linear regression (Optional)

Do you have any questions about anything covered in the class thus far? This includes

  • Using RStudio
  • Writing reproducible reports using R Markdown
  • Version control and GitHub
  • Exploratory data analysis
  • Simple linear regression

If so, please click here to submit your question(s). Frequently asked questions will be discussed in an upcoming lecture session.

Submission

Knit, commit, and push your final changes to your GitHub repo. Then, submit the PDF on Gradescope. See Lab 01 for more detailed submission instructions.

Grading

Total 50
Part 1: Conceptual questions 40
Part 2: Communicating results 5
Document neatly formatted with clear question headers 3
At least 3 informative commit messages 2