In today’s lab, you’ll use simple linear regression to analyze the relationship between the median early career salary and percent of alumni who perceive their job as making the world a better place for colleges and universities in the United States. You will also start working with your lab teams and be introduced to using GitHub for collaboration.
By the end of the lab you will…
See the STA 210 teams document to see your team assignment. This will be your team for labs and the final project.
Before you get started on the lab assignment, your TA will walk you through
By the end of the lab session, you should come up with a new team name. You can’t use the same team name as another team, so be creative! Your TA will come around to record your team name by the end of lab.
A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to STA 210 course organization on GitHub.
In addition to your private individual repositories, you should now see a repo named lab-03. Go to that repository.
Each person on the team should clone the repository and open a new project in RStudio. Do not make any changes to the .Rmd file until the instructions tell you do to so.
Assign each person on your team a number 1 through 4. For teams of three, Member 1 can take on the role of Member 4.
The following exercises must be done in order. Only one person should type in the .Rmd file and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.
Team Member 1: Change the author to your team name and include each team member’s name in the author
field of the YAML in the following format. Team Name: Member 1, Member 2, Member 3, Member 4
.
We will use the following packages in today’s lab.
library(tidyverse)
library(knitr)
library(broom)
# add additional packages as needed
Today’s data are two data sets that were part of TidyTuesday College tuition, diversity, and pay. All of the information is originally from the Department of Education but was collected from various websites.
We will use the following two data sets in this lab:
alumni
data setThe information in this data set was collected from the PayScale College Salary Report.
variable | class | description |
---|---|---|
rank | double | Potential salary rank within state |
name | character | Name of school |
state_name | character | state name |
early_career_pay | double | Median salary for alumni with 0 - 5 years experience (in US dollars) |
mid_career_pay | double | Median salary for alumni with 0 - 5 years experience (in US dollars) |
make_world_better_percent | double | Percent of alumni who think they are making the world a better place |
stem_percent | double | Percent of degrees awarded in science, technology, engineering, or math subjects |
tuition
data setThe information in this data set was collected from Tuition Tracker.
variable | class | description |
---|---|---|
name | character | School name |
state | character | State name |
state_code | character | State Abbreviation |
type | character | Type: Public, private, for-profit |
degree_length | character | 4 year or 2 year degree |
room_and_board | double | Room and board in USD |
in_state_tuition | double | Tuition for in-state residents in USD |
in_state_total | double | Total cost for in-state residents in USD (sum of room & board + in state tuition) |
out_of_state_tuition | double | Tuition for out-of-state residents in USD |
out_of_state_total | double | Total cost for in-state residents in USD (sum of room & board + out of state tuition) |
The data set for this lab will focus on higher education institutions where the median early career pay is $70,000 or less and that are included in both data sets.
<- read_csv("data/alumni-salaries.csv")
alumni <- read_csv("data/tuition.csv") tuition
Is there a relationship between someone’s early career salary and whether they perceive their job as making the world a better place? To answer this question, we will use linear regression to fit a model using the percent of alumni who perceive their job is making the world a better place to predict the median early career salary for colleges and universities in the United States.
Team Member 1: Type the team’s responses to exercises 1 - 4.
Some observations have missing values for make_world_better_percent
. Before you get started, filter the alumni
data frame so you only have observations that have data for both make_world_better_percent
and early_career_pay
.
Plot the distribution of the response variable and calculate the appropriate summary statistics. Describe the distribution including the shape, center, spread, and any outliers.
Plot the distribution of explanatory variable and calculate the appropriate summary statistics. Describe the distribution including the shape, center, spread, and any outliers.
What do you expect the relationship to be between the percent of alumni who perceive their jobs of making the world a better place and the typical career salary? Discuss as a group and summarize the group’s thoughts.
Create a visualization to display the relationship between early_career_pay
and make_world_better_percent
. Use the scatter plot to describe the relationship between the two variables. Is the relationship what your group expected? Why or why not?
✅ ⬆️ Team Member 1: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 1- 4.
Team Member 2: It’s your turn! Type the team’s response to exercises 5 - 7.
Fit a simple linear regression model for the relationship between early career pay and perceiving their job as making the world a better place. Display the code and output, include a confidence interval for the coefficient, and use the kable
function to neatly display the model output with 3 digits.
Interpret the slope in the context of the problem.
Does it make sense to interpret the intercept? If so, interpret the intercept. Otherwise, briefly explain why not.
✅ ⬆️ Team Member 2: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 5 - 7.
Team Member 3: It’s your turn! Type the team’s response to exercises 8 - 10.
Is the linear relationship between early_career_pay
and make_world_better_percent
statistically significant?
Interpret the 95% confidence interval for the slope in the context of the data.
See Section 13.4 Mutating joins in R for Data Science for more details about joins.
tuition_cost
data frame. Join tuition_cost
and alumni
to create a single data frame. Hint: You need to consider the name and state, since there are some states with colleges of the same name.✅ ⬆️ Team Member 3: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 8 - 10.
Team Member 4: It’s your turn! Type the team’s response to exercises 11 - 13.
Fit a model for the relationship between early_career_pay
and make_world_better_percent
for Private institutions. Use the kable
function to neatly display the model output with 3 digits and confidence intervals for the slope and intercept.
Fit a model for the relationship between early_career_pay
and make_world_better_percent
for Public institutions. Use the kable
function to neatly display the model output with 3 digits and confidence intervals for the slope and intercept.
Based on the previous exercises, is there evidence that does the relationship between early_career_pay
and make_world_better_percent
differs significantly by institution type? Briefly explain your reasoning.
✅ ⬆️ Team Member 4: Knit, commit and push your changes to GitHub with an appropriate commit message again. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the team’s completed lab!
Go back through your write up to make sure you followed the coding style guidelines we discussed in class (e.g. no long lines of code)
Team Member 2: Make any edits as needed. Then knit, commit, and push the updated documents to GitHub if you made any changes.
All other team members can click to pull the finalized document.
Team Member 3: Upload the team’s PDF to Gradescope. Be sure to include every team member’s name in the Gradescope submission Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.
There should only be one submission per team on Gradescope.