Getting Started

You can find the hw-05 repo in the sta210-sp21 organization on GitHub. This repo contains all materials needed to start the assignment.

Tips

Here are some tips as you complete the assignment:

  • Periodically knit your document and commit changes (the more often the better, for example, once per each new task). You should have at least 3 commit messages by the end of the assignment.
  • Push all your changes back to your GitHub repo. The Git pane in RStudio should be empty after you push. You can also check your assignment repo on github.com to make sure it has updated.
  • Once you have completed your work, you will submit your assignment in Gradescope. You are welcome to resubmit multiple times until the assignment deadline. We will grade the most recent version of the .pdf file in Gradescope.

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr)
library(yardstick)
#add other packages as needed

Data

The data for this assignment comes from an online Ipsos survey that was conducted for the FiveThirtyEight article “Why Many Americans Don’t Vote”. You can read more about the survey design and respondents in the README of the GitHub repo for the data.

Respondents were asked a variety of questions about their political beliefs, thoughts on multiple issues, and voting behavior. We will focus on using the demographic variables and someone’s party identification to understand whether a person is a probable voter.

The variables we’ll focus on are (definitions from the codebook in data set GitHub repo):

  • ppage: Age of respondent

  • educ: Highest educational attainment category.

  • race: Race of respondent, census categories. Note: all categories except Hispanic are non-Hispanic.

  • gender: Gender of respondent

  • income_cat: Household income category of respondent

  • Q30: Response to the question “Generally speaking, do you think of yourself as a…”

    • 1: Republican
    • 2: Democrat
    • 3: Independent
    • 4: Another party, please specify
    • 5: No preference
    • -1: No response
  • voter_category: past voting behavior:

    • always: respondent voted in all or all-but-one of the elections they were eligible in
    • sporadic: respondent voted in at least two, but fewer than all-but-one of the elections they were eligible in
    • rarely/never: respondent voted in 0 or 1 of the elections they were eligible in

You can read in the data directly from the GitHub repo:

voter_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/non-voters/nonvoters_data.csv")

Note that the authors use weighting to make the final sample more representative on the US population for their article. We will not use the weighting in this assignment, so we should treat the sample as a convenience sample rather than a random sample of the population.

Conceptual questions

  1. Why do you think the authors chose to only include data from people who were eligible to vote for at least four election cycles?

  2. Let’s prepare the data for analysis and modeling.

    • The variable Q30 contains the respondent’s political party identification. Make a new variable that simplifies Q30 into four categories: “Democrat”, “Republican”, “Independent”, “Other” (“Other” also includes respondents who did not answer the question).
    • We want to use logistic regression to understand the factors related to a person being a possible voter. The variable voter_category identifies the respondent’s past voter behavior. We’ll consider a person a “possible” voter if they describe their past voting behavior as “always” or “sporadic”. Create a new variable that takes value 1 if the voter_category is “always” or “sporadic” and 0 otherwise.
  3. Fit a model using mean-centered age, race, gender, income, and education to predict the odds that a person is a possible voter. Display the model.

  4. Should party identification be added to the model? Use the appropriate test to determine if party identification should be added to the model. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.

Use the model you select for the remainder of the assignment.

  1. Let’s interpret the coefficients from the data in terms of the odds that a person is a possible voter.

    • Interpret the intercept in the context of the data. Use actual values in the interpretation.
    • Interpret the effect of age in the context of the data.
    • Interpret the effect of party ID in the context of the data. Include discussion about which level(s) differ from the baseline.
  2. In the article, the authors write

    “Nonvoters were more likely to have lower incomes; to be young; to have lower levels of education; and to say they don’t belong to either political party, which are all traits that square with what we know about people less likely to engage with the political system.”

Is your model consistent with this statement? Briefly explain why or why not.

  1. Let’s assess how well the model fits the data. Make a ROC curve and calculate AUC. Based on these, is the model a good fit for the data? Briefly explain why or why not.

  2. Lastly, let’s use the model to identify possible voters! Let’s suppose you want to use the model to identify possible voters to help a Get Out the Vote (GOTV) campaign target its efforts in the final weeks before an election. You decide to use a threshold of 0.58 to classify observations into possible voters vs. non-voters.

    • Show the confusion matrix at this threshold.
    • Calculate the true and false positive rates and explain what each value means in the context of the data.

Submission

Knit, commit, and push your final changes to your GitHub repo. Then, submit the PDF on Gradescope. See Lab 01 for more detailed submission instructions.

Grading

Total 50
Part 1: Conceptual questions 45
Document a PDF rendered from the Rmd file with clear question headers 3
At least 3 informative commit messages 2