Lab 02: Trails in Durham, NC

due Sun, Feb 07 at 11:59p

The primary objectives for this lab are to continue practicing with the computing workflow and exploring relationships between multiple variables. You will also use statistical inference to draw conclusions as you analyze data about trails in Durham, North Carolina.

Learning goals

By the end of the lab you will…

Getting Started

Each of your assignments will begin with the steps outlined in this section. For more detailed instructions about getting started, see the Lab 01 instructions.

Clone the repo & start new RStudio project

Configure git

One more thing before you get started. We need to configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your GitHub username and the email address associated with your GitHub account.

Type the following lines of code in the console in RStudio filling in your name and email address.

library(usethis)
use_git_config(user.name = "GitHub username", 
               user.email="your email")

Now you’re ready to start! Open the R Markdown file (the one with extension .Rmd) to begin the assignment.

YAML

The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.

Before we introduce the data, let’s update the name and date in the YAML.

Open the R Markdown (Rmd) file in your project, input your name for the author name and today’s date for the date, and knit the document.

Committing and pushing changes:

Make sure you push all the files from the Git pane to your assignment repo on GitHub. The Git pane should be empty after you push. If it’s not, click the box next to the remaining files, write an informative commit message, and push.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(infer)

Data: Trails in Durham, NC

Today’s data comes from the Durham Open Data a program managed by the City of Durham’s Technology Solutions Department and the County of Durham’s Information Services & Technology Department to provide open data about our community. You can read more about their mission here.

The data for this lab contains characteristics of proposed and existing outdoor trails in Durham, NC. The data is located in the durham_trails.csv file located in the data folder. Use the code below to load the data set. Note: This data set only includes trails with a reported value for length.

trails <- read_csv("data/durham_trails.csv")

A full list of the variables in the dataset is available here. For today’s analysis, we will primarily focus on the following variables:

status Whether the trail is proposed or existing
dogwalking Whether dogs are allowed on the trail
length Length of the trail in miles
trailtype Description of trail (crosswalk, sidewalk, walking path, etc.)

Exercises

Instructions

Write all code and narrative in your R Markdown file. Throughout the assignment, you should periodically knit your R Markdown document to produce the updated PDF, commit the changes in the Git pane, and push the updated files to GitHub.

As you complete the lab…

✅ Write all narrative in complete sentences.

✅ Use an informative title and informative axis labels on all visualizations.

We want to explore whether there is a difference between the typical length of trails that allow dog walking versus those that don’t. To answer this question, we will use statistical inference to determine if the mean length differs between the two types of trails.

  1. Let’s begin by examining the distribution of the trail status with a data visualization and summary statistics.

    • Create a visualization for the distribution of status, and calculate the proportion of observations in each category of status. You can use the code belwo to help you calculate the proportions.
    • Use the plot and summary statistics to describe the distribution of status.
_____ %>%
  count(_____) %>%
  mutate(proportion = n / sum(n))
  1. Since the primary goal is to analyze the length of trails for trails in Durham, we will just use data for trails that existed as of November 18, 2019, the latest edit date in the data. Create a subset that only includes existing trails. You will use the subset of existing trails for exercises 3 - 7.

  2. Next let’s examine the distribution of length for existing trails. Make a graph to visualize the distribution of length.

See Section 7.3.1 “Visualizing Distributions” or the ggplot2 reference page for details and example code.

This is a good place to knit, commit, and push changes to your remote lab-02 repo on GitHub. Be sure to write an informative commit message (e.g. “Completed exercises 1 - 3”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty."

  1. Though we can approximate the center and spread of a distribution from visualizations, it is useful to calculate summary statistics to more precisely describe distributions. Calculate the following summary statistics for length:

    • min, max
    • mean, median
    • Q1, Q3
    • IQR, standard deviation

Hint: You can use the summarise reference page for code and examples.

  1. Which summary statistics are most useful for describing the distribution of length? In other words, which summary statistics most accurately reflect the center and spread of the distribution? Briefly explain.

  2. Use the visualization and summary statistics to describe the distribution of length. Include a description of the shape, center, spread, and potential outliers.

  3. People are more likely to take their dogs on trails that can be completed in a day, so let’s focus on trails could be feasibly completed as a day hike. Create a subset that only includes existing trails that are 10 miles or less. You will use this subset for the remainder of the lab.

This is a good place to knit, commit, and push changes to your remote lab-02 repo on GitHub. Be sure to write informative commit message (e.g. “Completed exercises 4 - 7”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

  1. Now that we’ve explored length, let’s look at the distribution of dogwalking.
    • What are the values of dogwalking in the dataset? To answer this question, use code to list the values of dogwalking and how often each one occurs. Note: We use code to help us answer this question, so we can produce the result in a reproducible way.
    • Consider the values of dogwalking currently in the data. What do you think a missing value of dogwalking most likely indicates?
  2. Complete the code below to impute (i.e. fill in) the missing values of dogwalking with the most reasonable value. Then, display the distribution of dogwalking to check that the missing values were correctly imputed.
_____ <- _____ %>%
  mutate(dogwalking = if_else(is.na(dogwalking), ______, dogwalking))
  1. Now that we’ve completed the univariate EDA (i.e. examining one variable at a time), let’s examine the relationship between the length of the trail and whether dog walking is allowed. Make a graph to visualize the relationship between length and dogwalking and calculate the appropriate summary statistics. Include informative axis labels and title on your graph.

  2. Describe the relationship between length and dogwalking. In other words, describe how the distribution of length compares between trails that allow dog walking versus those that don’t. Include information from the graph and summary statistics from the previous exercise in your response.

This is a good place to knit, commit, and push changes to your remote lab-01 repo on GitHub. Be sure to write informative commit message (e.g. “Completed exercises 8 - 11”), and push every file to GitHub by clicking the checkbox next to each file in the Git pane. After you push the changes, the Git pane in RStudio should be empty.

  1. Let’s use statistical inference to determine if the data provide sufficient evidence of a difference in the mean length between trails that allow dog walking and those that don’t. We’ll begin with a hypothesis test. The null and alternative hypotheses are the following:

\[\begin{align} &H_0: \mu_{yes} = \mu_{no} \\ &H_a: \mu_{yes} \neq \mu_{no}\end{align}\]

  1. Let’s calculate a confidence interval to estimate the difference in the mean length between trails that allow dog-walking and those that don’t. If it isn’t displayed in Exercise 12, show the code and output to calculate the confidence interval. Then, interpret your interval in the context of the data.

You’re done and ready to submit your work! Knit, commit, and push all remaining changes. You can use the commit message “Done with Lab 2!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and thatall documents are updated in your repo on GitHub. Then submit the assignment on Gradescope following the instructions below.

Submission

In this class, we’ll be submitting PDF documents to Gradescope.

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Resources for additional practice (optional)