In this lab,we will use backward and forward selection to choose a model that can be use to understand the variability in the positiveness of a song based on other characteristics. We will also check the diagnostics for the selected model.

Learning goals

By the end of the lab you will be able to…

Conduct forward and backward model selection in R
Compare models using AIC, BIC, and Adjusted \(R^2\)
Assess model diagnostics and identify potential outliers

Getting started

Clone assignment repo + start new project

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.

Go to course organization on GitHub.
In addition to your private individual repositories, you should now see a repo named lab-07-. Go to that repository.
Each person on the team should clone the repository and open a new project in RStudio.

Workflow: Using git and GitHub as a team

Now that you have had some experience collaborating as a team on GitHub, it will be up to your team to decide who types the responses to each question. Every team member should contribute to the discussion about each lab exercise even if they are not the one typing the team’s responses.

Every team member should have at least one meaningful commit to the repo on GitHub.

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal in RStudio:

git config --global credential.helper 'cache --timeout 604800'

Packages + data

Packages

We will use the following packages in today’s lab. Feel free to add any other packages as needed.

library(tidyverse)
library(broom)
library(knitr)
library(rms)

Data

In this lab, we will use the spotify data originally seen in HW 02. This data is a subset from the Spotify Songs Tidy Tuesday data set. The data were originally obtained from Spotify using the spotifyr R package.

Click here for a full list of variables and their definitions.

spotify <- read_csv("data/spotify-popular.csv") %>%
  mutate(key = factor(key), 
         mode = factor(mode))

The songs in this data set are the most popular songs on Spotify (track_popularity >= 80).

Exercises

Model selection

The goal of this lab is to use forward and backward selection to choose a model to understand variability in the valence, i.e. positiveness, of popular Spotify songs based on other characteristics. To get started, we’ll make a full model (includes all potential main effects) and an intercept only model. These will be the starting point for backward and forward selection, respectively.
- Fit a full model that includes the following predictors: danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, tempo, duration_ms, playlist_genre.
- Fit the intercept only model. See the lecture notes for example code to fit an intercept only model.
Save both models to use in later exercises.
Use the step function to perform backward model selection using AIC. Save the selected model as backward_aic. Put results = "hide" in the code chunk for the step function to suppress the step-by-step model selection output. Then, a separate code chunk, display the selected model using 3 digits.
Now let’s use BIC as the selection criterion. To do so, we will use the step function, including the argument k = log(n) where \(n\) is the number of observations in the data set.

Perform backward selection using BIC. Save the selected model as backward_bic. Use the same code chunk argument as in the previous exercise to suppress the step-by-step output. Display the selected model using 3 digits.
Compare the models from exercises 2 and 3.
- Do the models have the same predictors? Do the models have the same number of predictors? Briefly explain.
- If not, which model has more predictors? Is this the model you’d expect to have more predictors? Briefly explain.
Now let’s perform forward model selection using AIC. Similar to before, we can use the step function to do so. When we do forward selection, however, we need to include the model we’re starting with (the intercept only model), and the largest possible model if every potential predictor is selected (the full model).

Use the code below to perform forward selection using AIC. Suppress the step-by-step model selection output, then display the selected model.

# replace int_only_model and full_model with the names you chose in Exercise 1
forward_aic <- step(int_only_model, formula(full_model), direction = "forward")

Perform forward model selection using BIC. Save the model as forward_bic. As in Exercise 3, include the argument k = log(n), where \(n\) is the number of observations in the data set, to use BIC as the selection criterion. Suppress the step-by-step model selection output, then display the selected model.
The forward model selection in exercises 5 and 6 resulted in different models.
- Calculate Adjusted \(R^2\) for models in exercises 5 and 6. See the lecture notes for code to calculate Adjusted \(R^2\).
- Which model do you choose based on Adjusted \(R^2\)? Briefly explain.

Model diagnostics

Use the model chosen in Exercise 7 for questions 8 - 10.

Make a plot of the standardized residuals versus the predicted values. On the plot, include dotted lines at \(\pm 3\) to help identify observations with residuals that have large magnitude.

Use the plot to comment on the linearity condition.
Now let’s check for high leverage and influential points.
- What is the threshold for identifying points as having high leverage? Show the calculations used to obtain this value.
- Do any points in the data set have high leverage? Show the code used to identify these points. Recall that leverage is stored as .hat in the data set produced by the augment function.
- Are any points in the data set influential? Show the code used to identify these points based on Cook’s distance. Recall that Cook’s distance is stored as .cooksd in the data set produced by the augment function.
Let’s conclude by checking for potential multicollinearity. Fill in the name of the selected model in the vif function (from the rms R package) to get the variance inflation factor (VIF) for the predictors. Display VIF using 3 digits.
- Based on VIF, are there any potential concerns with multicollinearity? Briefly explain.
- If so, which variables?

Lab 07: What makes a song more positive?

due Sun, Mar 28, at 11:59p EDT