You can find the hw-02 repo in the sta210-sp21 organization on GitHub. This repo contains the starter documents and data set needed to complete the lab.
See the Lab 01 instructions for more details about cloning the repo, starting a new RStudio project, and configuring git.
Here are some tips as you complete HW 02:
We will use the following packages in this assignment:
library(tidyverse)
library(broom)
library(knitr)
#add other packages as needed
The data set for this assignment is a subset from the Spotify Songs Tidy Tuesday data set. The data were originally obtained from Spotify using the spotifyr R package.
The data set contains numerous characteristics for each song. You can see the full list of variables and definitions here. This analysis will focus specifically on the following variables:
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
You can find more information about these music characteristics in the Spotify documentation page.
The data set is located in spotify-popular.csv
in the data
folder.
This section of homework contains short answer questions focused on the concepts discussed in class. Some of these questions may also include short computing tasks.
All narrative should be written in complete sentences, and all visualizations should have informative titles and axis labels.
Are high energy songs more positive? To answer this question, we’ll analyze data on some of the most popular songs on Spotify, i.e. those with track_popularity >= 80
. We’ll use linear regression to fit a model to predict a song’s positiveness (valence
) based on its energy level (energy
).
Create a plot to visualize the relationship between a song’s positiveness and energy level. Interpret the plot in the context of the data.
Write the equation of the statistical model using mathematical notation.
Fit the regression model. Display the output using 3 digits and include a short and informative caption.
What percent of the variation in valence is explained by a song’s energy? Calculate this value using the ANOVA table. Show how you calculated this answer.
Next, let’s check the model conditions. First, let’s start with the linearity condition. Is this condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.
Is the constant variance condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.
Is the Normality condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.
Is the independence condition satisfied? Briefly explain your reasoning showing any output and/or plots used to check this condition.
Let’s finish by using the model to make predictions:
Obtain the predicted value and appropriate 90% interval for the mean valence of the subset of songs with energy of 0.8. Interpret the interval in the context of the data.
Obtain the predicted value and appropriate 90% interval for the valence of the song “Put Your Records On” by Corinne Bailey Rae, which has an energy of 0.55. Interpret the interval in the contextt of the data.
Part of a statistician’s / data scientist’s job is being able to communicate technical results to a general (often non-technical) audience. This section of the homework will focus specifically on communication and writing some of the results for a general audience.
Suppose a radio station wants to use your model to help them determine which songs to play based on a song’s predicted positiveness. What are some potential limitations the radio station should consider before using your model? Limitations may be related to the data itself, the statistical methods, model conditions, etc.
Write a short paragraph for the radio station manager (about 3 - 5 sentences) describing some potential limitations. You can assume the manager has taken an introductory statistics class and understands basic statistical terms. You may include any plots or tables as needed.
Do you have any questions about anything covered in the class thus far? This includes
If so, please click here to submit your question(s). Frequently asked questions will be discussed in an upcoming lecture session.
Knit, commit, and push your final changes to your GitHub repo. Then, submit the PDF on Gradescope. See Lab 01 for more detailed submission instructions.
Total | 50 |
---|---|
Part 1: Conceptual questions | 40 |
Part 2: Communicating results | 5 |
Document neatly formatted with clear question headers | 3 |
At least 3 informative commit messages | 2 |