Lab 03: Alumni jobs

due Sun, Feb 14 at 11:59p EST

In today’s lab, you’ll use simple linear regression to analyze the relationship between the median early career salary and percent of alumni who perceive their job as making the world a better place for colleges and universities in the United States. You will also start working with your lab teams and be introduced to using GitHub for collaboration.

Learning goals

By the end of the lab you will…

Meet your team!

See the STA 210 teams document to see your team assignment. This will be your team for labs and the final project.

Before you get started on the lab assignment, your TA will walk you through

By the end of the lab session, you should come up with a new team name. You can’t use the same team name as another team, so be creative! Your TA will come around to record your team name by the end of lab.

Getting Started

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.

Workflow: Using git and GitHub as a team

Assign each person on your team a number 1 through 4. For teams of three, Member 1 can take on the role of Member 4.

The following exercises must be done in order. Only one person should type in the .Rmd file and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.

Update YAML

Team Member 1: Change the author to your team name and include each team member’s name in the author field of the YAML in the following format. Team Name: Member 1, Member 2, Member 3, Member 4.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(broom)
# add additional packages as needed

Data: Recent college graduates

Today’s data are two data sets that were part of TidyTuesday College tuition, diversity, and pay. All of the information is originally from the Department of Education but was collected from various websites.

We will use the following two data sets in this lab:

alumni data set

The information in this data set was collected from the PayScale College Salary Report.

variable class description
rank double Potential salary rank within state
name character Name of school
state_name character state name
early_career_pay double Median salary for alumni with 0 - 5 years experience (in US dollars)
mid_career_pay double Median salary for alumni with 0 - 5 years experience (in US dollars)
make_world_better_percent double Percent of alumni who think they are making the world a better place
stem_percent double Percent of degrees awarded in science, technology, engineering, or math subjects

tuition data set

The information in this data set was collected from Tuition Tracker.

variable class description
name character School name
state character State name
state_code character State Abbreviation
type character Type: Public, private, for-profit
degree_length character 4 year or 2 year degree
room_and_board double Room and board in USD
in_state_tuition double Tuition for in-state residents in USD
in_state_total double Total cost for in-state residents in USD (sum of room & board + in state tuition)
out_of_state_tuition double Tuition for out-of-state residents in USD
out_of_state_total double Total cost for in-state residents in USD (sum of room & board + out of state tuition)

Load data

The data set for this lab will focus on higher education institutions where the median early career pay is $70,000 or less and that are included in both data sets.

alumni <- read_csv("data/alumni-salaries.csv")
tuition <- read_csv("data/tuition.csv")

Exercises

Instructions

Is there a relationship between someone’s early career salary and whether they perceive their job as making the world a better place? To answer this question, we will use linear regression to fit a model using the percent of alumni who perceive their job is making the world a better place to predict the median early career salary for colleges and universities in the United States.

Team Member 1: Type the team’s responses to exercises 1 - 4.

Some observations have missing values for make_world_better_percent. Before you get started, filter the alumni data frame so you only have observations that have data for both make_world_better_percent and early_career_pay.

  1. Plot the distribution of the response variable and calculate the appropriate summary statistics. Describe the distribution including the shape, center, spread, and any outliers.

  2. Plot the distribution of explanatory variable and calculate the appropriate summary statistics. Describe the distribution including the shape, center, spread, and any outliers.

  3. What do you expect the relationship to be between the percent of alumni who perceive their jobs of making the world a better place and the typical career salary? Discuss as a group and summarize the group’s thoughts.

  4. Create a visualization to display the relationship between early_career_pay and make_world_better_percent. Use the scatter plot to describe the relationship between the two variables. Is the relationship what your group expected? Why or why not?

✅ ⬆️ Team Member 1: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 1- 4.

Team Member 2: It’s your turn! Type the team’s response to exercises 5 - 7.

  1. Fit a simple linear regression model for the relationship between early career pay and perceiving their job as making the world a better place. Display the code and output, include a confidence interval for the coefficient, and use the kable function to neatly display the model output with 3 digits.

  2. Interpret the slope in the context of the problem.

  3. Does it make sense to interpret the intercept? If so, interpret the intercept. Otherwise, briefly explain why not.

✅ ⬆️ Team Member 2: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 5 - 7.

Team Member 3: It’s your turn! Type the team’s response to exercises 8 - 10.

  1. Is the linear relationship between early_career_pay and make_world_better_percent statistically significant?

    • State the null and alternative hypotheses used to answer this question in words and in mathematical notation.
    • What is the test statistic? Write the equation used to calculate the test statistic.
    • What distribution was used to calculate the p-value?
    • State your conclusion for the test in the context of the data.
  2. Interpret the 95% confidence interval for the slope in the context of the data.

See Section 13.4 Mutating joins in R for Data Science for more details about joins.

  1. For the remainder of the lab, we will consider differences between private and public institutions. The institution type is in the tuition_cost data frame. Join tuition_cost and alumni to create a single data frame. Hint: You need to consider the name and state, since there are some states with colleges of the same name.

✅ ⬆️ Team Member 3: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 8 - 10.

Team Member 4: It’s your turn! Type the team’s response to exercises 11 - 13.

  1. Fit a model for the relationship between early_career_pay and make_world_better_percent for Private institutions. Use the kable function to neatly display the model output with 3 digits and confidence intervals for the slope and intercept.

    • Interpret the interval for the slope in the context of the data.
  2. Fit a model for the relationship between early_career_pay and make_world_better_percent for Public institutions. Use the kable function to neatly display the model output with 3 digits and confidence intervals for the slope and intercept.

    • Interpret the interval for the slope in the context of the data.
  3. Based on the previous exercises, is there evidence that does the relationship between early_career_pay and make_world_better_percent differs significantly by institution type? Briefly explain your reasoning.

✅ ⬆️ Team Member 4: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the team’s completed lab!

Wrapping up

Go back through your write up to make sure you followed the coding style guidelines we discussed in class (e.g. no long lines of code)

Team Member 2: Make any edits as needed. Then knit, commit, and push the updated documents to GitHub if you made any changes.

All other team members can click to pull the finalized document.

Submission

Team Member 3: Upload the team’s PDF to Gradescope. Be sure to include every team member’s name in the Gradescope submission Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.

There should only be one submission per team on Gradescope.