Lab 5 - Probability and Simpson’s Paradox
Learning Goals
– Calculate single event, conditional, and “and” probabilities
– Interpret probabilities in the context of the problem
– Display a fundamental understanding of Simpson’s Paradox
Exercise 1 - Probability and You
We use probabilities all the time when making decisions. As a group, provide two real world examples of when you’ve used probability to make decisions in your every day life. Think critically. Be creative.
Exercise 2 - Risk of coronary heart disease
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. We want to examine the relationship between various health characteristics and the risk of having heart disease. Load in the data set called education-disease
and answer the following questions below.
# insert-code-here
Note: Show your work by writing your answer as a fraction to potentially receive partial credit. For example, if your numerator is the combination of two values, show the addition to obtain these values. Example: (1+4) / 10 = 0.5.
How many levels of education are there in these data? How many levels of disease are there?
Convert the data to a 4x3 data table where each cell is the number of people falling into each combination of Disease and Education. Hint: use
count
andpivot_wider
.
# insert-code-here
Using your data table from part b, answer the remaining questions.
What is the probability of a random individual being high risk for cardiovascular disease?
What is the probability of a random individual having high school or GED education and not being high risk for cardiovascular disease?
What is the probability that a random individual who is already high risk for cardiovascular disease has a college education?
If you haven’t yet done so, now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 3 - Computer store
In a computer store, 30% of the computers in stock are laptops and 70% are desktops. Five percent of the laptops are on sale, while 10% of the desktops are on sale.
- Fill in the code below to create a hypothetical two-way table to represent this situation. Assume the total number of computers is 1000.
<- tibble(
data Type = c(),
Sale = c(),
values = ()
)
|>
data pivot_wider(
names_from = ...,
values_from = ....)
Calculate the probability of that a randomly selected computer will be a desktop, given that the computer is on sale.
In your own words, explain what this probability means.
- Problems adapted from Mind on Statistics, 5th Ed. By Utts and Heckard
If you haven’t yet done so, now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.
Exercise 4 - Bike rentals
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. You are tasked to investigate the relationship between the temperature outside and the number of bikes rented in the Washington DC area between the years 2011 and 2022. You will be investigating data for the months June, July, September, and November.
Below is a list of variables and their definitions:
Variable | Definition |
---|---|
season | numerical representation of Summer, Fall, Winter, and Spring |
year | numerical representation of 2011 (0) or 2012 (1) |
mnth | month in which data were collected |
holiday | indicator variable if data were collected on a holiday |
weekday | numerical representation of day of week |
temp | temperature in Celsius |
cnt | count of bike rentals |
- Read in the data. Then, create a scatter plot that investigates the relationship between the number of bikes rented and the temperature outside. Include a straight line of best fit to help discuss the discovered relationship. Summarize your findings in 2-3 sentences.
bike <- read_csv("data/bike.csv")
- Another researcher suggests to look at the relationship between bikes rented and temperature by each of the four months of interest. Recreate your plot in part a, and color the points by month. Include a straight line for each of the four months to help discuss each month’s relationship between bikes rented and temperature. In 3-4 sentences, summarize your findings.
Please watch the following video on Simpson’s Paradox here. After you do, please answer the following questions.
In your own words, summarize Simpson’s Paradox in 2-3 sentences.
Compare and contrast your findings from part (a) and part (b). What’s different?
Think critically about your answer to part d. What other context from this study could be creating this paradox? That is, identify a potential confounding variable in this study. Be sure to justify how your example could be a potential confounding variable by relating it back to both bike rentals and temperature.
- Data subset from Progress in Artificial Intelligence, 2013.
Submission
Once you are finished with the lab, you will your final PDF document to Gradescope.
To submit your assignment:
- Go to http://www.gradescope.com and click Log in in the top right corner.
- Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
- Click on your STA 199 course.
- Click on the assignment, and you’ll be prompted to submit it.
- Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
- Select the first page of your .pdf submission to be associated with the “Workflow & formatting” question.
Grading
Component | Points |
---|---|
Ex 1 | 4 |
Ex 2 | 14 |
Ex 3 | 11 |
Ex 4 | 16 |
Workflow & formatting | 5 |
Total | 50 |