The primary objectives for this lab are to continue practicing with the computing workflow and exploring relationships between multiple variables. You will also use statistical inference to draw conclusions as you analyze data about trails in Durham, North Carolina.
By the end of the lab you will…
Each of your assignments will begin with the steps outlined in this section. For more detailed instructions about getting started, see the Lab 01 instructions.
Go to the sta210-sp21 organization on GitHub . Click on the repo with the prefix lab-02-. It contains the starter documents you need to complete the lab.
Click on the green Code button, select Use HTTPS. Click on the clipboard icon to copy the repo URL.
Go to https://vm-manage.oit.duke.edu/containers and login with your Duke NetId and Password.
Click to log into the Docker container STA 210 - Regression Analysis. You should now see the RStudio environment.
Go to File ➡️ New Project ➡️ Version Control ➡️ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL.
Click Create Project, and the files from your GitHub repo will be displayed the Files pane in RStudio.
One more thing before you get started. We need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your GitHub username and the email address associated with your GitHub account.
Type the following lines of code in the console in RStudio filling in your name and email address.
Now you’re ready to start! Open the R Markdown file (the one with extension .Rmd) to begin the assignment.
The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Before we introduce the data, let’s update the name and date in the YAML.
Open the R Markdown (Rmd) file in your project, input your name for the author name and today’s date for the date, and knit the document.
Make sure you push all the files from the Git pane to your assignment repo on GitHub. The Git pane should be empty after you push. If it’s not, click the box next to the remaining files, write an informative commit message, and push.
Today’s data comes from the Durham Open Data a program managed by the City of Durham’s Technology Solutions Department and the County of Durham’s Information Services & Technology Department to provide open data about our community. You can read more about their mission here.
The data for this lab contains characteristics of proposed and existing outdoor trails in Durham, NC. The data is located in the durham_trails.csv file located in the data folder. Use the code below to load the data set. Note: This data set only includes trails with a reported value for length.
A full list of the variables in the dataset is available here. For today’s analysis, we will primarily focus on the following variables:
status |
Whether the trail is proposed or existing |
dogwalking |
Whether dogs are allowed on the trail |
length |
Length of the trail in miles |
trailtype |
Description of trail (crosswalk, sidewalk, walking path, etc.) |
Write all code and narrative in your R Markdown file. Throughout the assignment, you should periodically knit your R Markdown document to produce the updated PDF, commit the changes in the Git pane, and push the updated files to GitHub.
As you complete the lab…
✅ Write all narrative in complete sentences.
✅ Use an informative title and informative axis labels on all visualizations.
We want to explore whether there is a difference between the typical length of trails that allow dog walking versus those that don’t. To answer this question, we will use statistical inference to determine if the mean length differs between the two types of trails.
Let’s begin by examining the distribution of the trail status with a data visualization and summary statistics.
status
, and calculate the proportion of observations in each category of status
. You can use the code belwo to help you calculate the proportions.status
.Since the primary goal is to analyze the length of trails for trails in Durham, we will just use data for trails that existed as of November 18, 2019, the latest edit date in the data. Create a subset that only includes existing trails. You will use the subset of existing trails for exercises 3 - 7.
Next let’s examine the distribution of length
for existing trails. Make a graph to visualize the distribution of length
.
See Section 7.3.1 “Visualizing Distributions” or the ggplot2 reference page for details and example code.
This is a good place to knit, commit, and push changes to your remote lab-02 repo on GitHub. Be sure to write an informative commit message (e.g. “Completed exercises 1 - 3”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty."
Though we can approximate the center and spread of a distribution from visualizations, it is useful to calculate summary statistics to more precisely describe distributions. Calculate the following summary statistics for length
:
Hint: You can use the summarise reference page for code and examples.
Which summary statistics are most useful for describing the distribution of length
? In other words, which summary statistics most accurately reflect the center and spread of the distribution? Briefly explain.
Use the visualization and summary statistics to describe the distribution of length
. Include a description of the shape, center, spread, and potential outliers.
People are more likely to take their dogs on trails that can be completed in a day, so let’s focus on trails could be feasibly completed as a day hike. Create a subset that only includes existing trails that are 10 miles or less. You will use this subset for the remainder of the lab.
This is a good place to knit, commit, and push changes to your remote lab-02 repo on GitHub. Be sure to write informative commit message (e.g. “Completed exercises 4 - 7”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
dogwalking
.
dogwalking
in the dataset? To answer this question, use code to list the values of dogwalking
and how often each one occurs. Note: We use code to help us answer this question, so we can produce the result in a reproducible way.dogwalking
currently in the data. What do you think a missing value of dogwalking
most likely indicates?dogwalking
with the most reasonable value. Then, display the distribution of dogwalking
to check that the missing values were correctly imputed.Now that we’ve completed the univariate EDA (i.e. examining one variable at a time), let’s examine the relationship between the length of the trail and whether dog walking is allowed. Make a graph to visualize the relationship between length
and dogwalking
and calculate the appropriate summary statistics. Include informative axis labels and title on your graph.
Describe the relationship between length
and dogwalking
. In other words, describe how the distribution of length
compares between trails that allow dog walking versus those that don’t. Include information from the graph and summary statistics from the previous exercise in your response.
This is a good place to knit, commit, and push changes to your remote lab-01 repo on GitHub. Be sure to write informative commit message (e.g. “Completed exercises 8 - 11”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.
\[\begin{align} &H_0: \mu_{yes} = \mu_{no} \\ &H_a: \mu_{yes} \neq \mu_{no}\end{align}\]
State the null and alternative in words in the context of the data.
Conduct the test in R. Show the code and resulting output.
What does the test statistic mean? Write your response in the context of the data.
State the conclusion of the test in the context of the data.
You’re done and ready to submit your work! Knit, commit, and push all remaining changes. You can use the commit message “Done with Lab 2!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and thatall documents are updated in your repo on GitHub. Then submit the assignment on Gradescope following the instructions below.
In this class, we’ll be submitting PDF documents to Gradescope.
Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.
Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.
To submit your assignment:
Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on your STA 210 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your lab should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Overall” section.
Chapter 2: Get Started Data Visualization by Kieran Healy
Chapter 3: Data visualization in R for Data Science by Hadley Wickham
RStudio Cloud Primers