Inference overview + Review
rc <- read_csv("data/roller_coasters.csv")
In this application exercise, we will be looking at a roller coaster and amusement park database by Duane Marden.1 This database records multiple features of roller coasters. For the purpose of this activity, we will work with a random sample of 157 roller coasters.
Variable | Description |
---|---|
age_group |
1: Older (Built between 1900-1979) 2: Recent (1980-1999) 3: Newest (2000-current) |
coaster |
Name of the roller coaster |
park |
Name of the park where the roller coaster is located |
city |
City where the roller coaster is located |
state |
State where the roller coaster is located |
type |
Material of track (Steel or Wooden) |
year_opened |
Year when roller coaster opened |
top_speed |
Maximum speed of roller coaster (mph) |
max_height |
Highest point of roller coaster (ft) |
drop |
Length of largest gap between high and low points of roller coaster (ft) |
length |
Length of roller coaster track (ft) |
duration |
Time length of roller coaster ride (sec) |
inversions |
Whether or not roller coaster flips passengers at any point (Yes or No) |
num_of_inversions |
Number of times roller coaster flips passengers |
Practice with data manipulation
-
Demo: Create a new variable called
opened_recently
that has a response ofyes
if the roller coaster was opened after 1970, andno
if the roller coaster was built during or before 1970. Save this new factor variable in therc
data set. Additionally, maketype
a factor variable.
We want to investigate the relationship between how fast a roller coaster goes, and how long a roller coaster lasts. Specifically, we are interested in how well the duration of a roller coaster explains how fast it is.
- Your turn: Based on this research question, which two variables should we use from our data set? Which is our response?
top_speed; duration
top_speed is our response
Your turn: Describe how we can construct a bootstrap distribution for the slope of the model predicting
top_speed
fromduration
.Demo: Now, construct this bootstrap distribution.
set.seed(12345)
boot_df_slope <- rc |>
specify(top_speed ~ duration) |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "slope")
Warning: Removed 34 rows containing missing values.
- Demo: Create a 90% confidence interval by filling in the code below.
# A tibble: 1 × 2
lower upper
<dbl> <dbl>
1 -0.0195 0.132
- Demo: Interpret the confidence interval in the context of the problem.
We are 90% confident that for a 1 min increase in duration, the true/population mean speed (y) is expected to be 0.0195 lower to 0.132 mph higher.
- Demo: Prediction: How is this different than what we’ve done before?
Now let’s practice predicting!
- Demo: Report the tidy output of your model below
speed_fit <- linear_reg() |>
set_engine("lm") |>
fit(top_speed ~ duration, data = rc)
tidy(speed_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 52.0 4.08 12.8 3.99e-24
2 duration 0.0560 0.0305 1.83 6.92e- 2
- Below is the estimated model in proper notation.
\(\widehat{top~speed} = 52 + 0.056 \times duration\)
- Demo: Use your model to estimate the top speed of a roller coaster if their duration is 155 minutes
Interpret
- Now, interpret the slope coefficient.
For a 1 unit minute increase in duration, we estimate on average a 0.0560 increase in mph.
- What if our response was logged? How does the interpretation change?
\(\widehat{log(TopSpeed)} = 52 + 0.056 \times Duration\)
Recall that to interpret the slope value we need to exponentiate it if we want to be back on the original scale!
Why?
\(log(y) = \beta_o + \beta_1x\)
y = \(exp(\beta_o + \beta_1x)\))
y = \(exp(\beta_o) * exp(\beta_1x)\))
This implies that our explanatory variable has a multiplicative relationship with our response variable
For a one minute increase in duration, we estimate on average, top speed to increase by a factor of 0.05, or by about 5%.
Research question 3
We are also interested in investigating if roller coasters after 1970 are faster than those opened before. For this question, we want to estimate the difference.
Should we make a confidence interval or hypothesis test?
Demo: Now, use bootstrapping to estimate the difference between the true speeds of roller coasters before and after 1970.
set.seed(12345)
boot_df <- rc |>
specify(top_speed ~ opened_recently) |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "diff in means", order = c("yes", "no"))
Warning: Removed 11 rows containing missing values.
- Your turn: Create a 99% confidence interval and interpret it in the context of the data and the research question.
# A tibble: 1 × 2
lower upper
<dbl> <dbl>
1 8.88 23.3
ORDER MATTERS (yes - no)
We are 99% confident that the true mean speed for roller coasters opened after 1970 (\(\mu_1\)) is (9.15, 23) mph LARGER than the true mean speed for roller coasters opened before 1970 (\(\mu_2\))
Discussion
- Brainstorm the different between bootstrap and permute. Write down the key differences below. Hint: Think back to the different sampling techniques between question 1 and question 2 / 3.
The main difference between permutation and bootstrapping is that bootstrapping resamples, with replacement from our data. Permutation can be thought of as shuffling / distributing our data / resampling without replacement.
Interpretation - Return of the Penguins
We will fit a SRL, MLR, and logistic regression model and compare how we interpret the slope coefficient.
Attaching package: 'palmerpenguins'
The following object is masked from 'package:modeldata':
penguins
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Simple Linear Regression
modelSLR <- linear_reg() |>
set_engine("lm") |>
fit(flipper_length_mm ~ bill_length_mm, data = penguins)
tidy(modelSLR)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 127. 4.67 27.2 3.65e-87
2 bill_length_mm 1.69 0.105 16.0 1.74e-43
- Interpret the slope coefficient assocciated with bill_length_mm in the context of the problem
For a 1 mm increase in bill length, we estimate on average, flipper length to increase by 1.69 mm.
Add response
Multiple Linear Regression
modelMLR <- linear_reg() |>
set_engine("lm") |>
fit(flipper_length_mm ~ bill_length_mm + island, data = penguins)
tidy(modelMLR)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 141. 4.03 35.1 9.60e-115
2 bill_length_mm 1.51 0.0879 17.2 4.74e- 48
3 islandDream -15.0 0.971 -15.4 4.51e- 41
4 islandTorgersen -8.98 1.42 -6.34 7.35e- 10
- Interpret the slope coefficient associated with bill_length_mm in the context of the problem
For a 1 mm increase in bill length, we estimate on average, flipper length to increase by 1.51 mm, holding island constant.
Logistic Regression
modellog <- logistic_reg() |>
set_engine("glm") |>
fit(sex ~ bill_length_mm, data = penguins)
tidy(modellog)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -6.04 1.01 -5.96 0.00000000247
2 bill_length_mm 0.138 0.0229 6.02 0.00000000176
- Interpret the slope coefficient associated with bill_length_mm in the context of the problem and on the original scale
For a 1 unit increase in bill length, we expect the odds of being female to increase by a factor of .14, or by about 14%.