Lab 5 - Probability and Simpson’s Paradox

Lab
Important

This lab is due October 24th at 11:59pm.

Learning Goals

– Calculate single event, conditional, and “and” probabilities

– Interpret probabilities in the context of the problem

– Display a fundamental understanding of Simpson’s Paradox

Exercise 1 - Probability and You

We use probabilities all the time when making decisions. As a group, provide two real world examples of when you’ve used probability to make decisions in your every day life. Think critically. Be creative.

Exercise 2 - Risk of coronary heart disease

This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease. Load in the data set called education-disease and answer the following questions below.

# insert-code-here

Note: Show your work by writing your answer as a fraction to potentially receive partial credit. For example, if your numerator is the combination of two values, show the addition to obtain these values. Example: (1+4) / 10 = 0.5.

  1. How many levels of education are there in these data? How many levels of disease are there?

  2. Convert the data to a 4x3 data table where each cell is the number of people falling into each combination of Disease and Education. Hint: use count and pivot_wider.

# insert-code-here

Using your data table from part b, answer the remaining questions.

  1. What is the probability of a random individual being high risk for cardiovascular disease?

  2. What is the probability of a random individual having high school or GED education and not being high risk for cardiovascular disease?

  3. What is the probability that a random individual who is already high risk for cardiovascular disease has a college education?

If you haven’t yet done so, now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.


Exercise 3 - Computer store

In a computer store, 30% of the computers in stock are laptops and 70% are desktops. Five percent of the laptops are on sale, while 10% of the desktops are on sale.

  1. Fill in the code below to create a hypothetical two-way table to represent this situation. Assume the total number of computers is 1000.
data <- tibble( 
  Type = c(),
  Sale = c(),
  values = ()
  )

data |>
  pivot_wider( 
    names_from = ...,
    values_from = ....)
  1. Calculate the probability of that a randomly selected computer will be a desktop, given that the computer is on sale.

  2. In your own words, explain what this probability means.

  • Problems adapted from Mind on Statistics, 5th Ed. By Utts and Heckard

If you haven’t yet done so, now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.


Exercise 4 - Bike rentals

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. You are tasked to investigate the relationship between the temperature outside and the number of bikes rented in the Washington DC area between the years 2011 and 2022. You will be investigating data for the months June, July, September, and November.

Below is a list of variables and their definitions:

Variable Definition
season numerical representation of Summer, Fall, Winter, and Spring
year numerical representation of 2011 (0) or 2012 (1)
mnth month in which data were collected
holiday indicator variable if data were collected on a holiday
weekday numerical representation of day of week
temp temperature in Celsius
cnt count of bike rentals
  1. Read in the data. Then, create a scatter plot that investigates the relationship between the number of bikes rented and the temperature outside. Include a straight line of best fit to help discuss the discovered relationship. Summarize your findings in 2-3 sentences.
bike <- read_csv("data/bike.csv")
  1. Another researcher suggests to look at the relationship between bikes rented and temperature by each of the four months of interest. Recreate your plot in part a, and color the points by month. Include a straight line for each of the four months to help discuss each month’s relationship between bikes rented and temperature. In 3-4 sentences, summarize your findings.

Please watch the following video on Simpson’s Paradox here. After you do, please answer the following questions.

  1. In your own words, summarize Simpson’s Paradox in 2-3 sentences.

  2. Compare and contrast your findings from part (a) and part (b). What’s different?

  3. Think critically about your answer to part d. What other context from this study could be creating this paradox? That is, identify a potential confounding variable in this study. Be sure to justify how your example could be a potential confounding variable by relating it back to both bike rentals and temperature.

  • Data subset from Progress in Artificial Intelligence, 2013.

Submission

Once you are finished with the lab, you will your final PDF document to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to render and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

Make sure your data are tidy! That is, your code should not be running off the pages and spaced properly. See: https://style.tidyverse.org/ggplot2.html

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.
  • Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
  • Click on your STA 199 course.
  • Click on the assignment, and you’ll be prompted to submit it.
  • Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
  • Select the first page of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Component Points
Ex 1 4
Ex 2 14
Ex 3 11
Ex 4 16
Workflow & formatting 5
Total 50
Note

The “Workflow & formatting” component assesses the reproducible workflow. This includes having at least 3 informative render messages, labeling the code chunks, and having readable code that does not exceed 80 characters, i.e., we can read all your code in the rendered PDF.