Inference overview + Review

library(tidyverse)
library(tidymodels)

rc <- read_csv("data/roller_coasters.csv")

In this application exercise, we will be looking at a roller coaster and amusement park database by Duane Marden.¹ This database records multiple features of roller coasters. For the purpose of this activity, we will work with a random sample of 157 roller coasters.

Variable	Description
`age_group`	1: Older (Built between 1900-1979) 2: Recent (1980-1999) 3: Newest (2000-current)
`coaster`	Name of the roller coaster
`park`	Name of the park where the roller coaster is located
`city`	City where the roller coaster is located
`state`	State where the roller coaster is located
`type`	Material of track (Steel or Wooden)
`year_opened`	Year when roller coaster opened
`top_speed`	Maximum speed of roller coaster (mph)
`max_height`	Highest point of roller coaster (ft)
`drop`	Length of largest gap between high and low points of roller coaster (ft)
`length`	Length of roller coaster track (ft)
`duration`	Time length of roller coaster ride (sec)
`inversions`	Whether or not roller coaster flips passengers at any point (Yes or No)
`num_of_inversions`	Number of times roller coaster flips passengers

Practice with data manipulation

Demo: Create a new variable called opened_recently that has a response of yes if the roller coaster was opened after 1970, and no if the roller coaster was built during or before 1970. Save this new factor variable in the rc data set. Additionally, make type a factor variable.

rc <- rc |>
  mutate(
    opened_recently = if_else(year_opened > 1970, "yes", "no"), 
    opened_recently = as.factor(opened_recently),
    type = as.factor(type)
  )

We want to investigate the relationship between how fast a roller coaster goes, and how long a roller coaster lasts. Specifically, we are interested in how well the duration of a roller coaster explains how fast it is.

Your turn: Based on this research question, which two variables should we use from our data set? Which is our response?

top_speed; duration

top_speed is our response

Your turn: Describe how we can construct a bootstrap distribution for the slope of the model predicting top_speed from duration.
Demo: Now, construct this bootstrap distribution.

set.seed(12345)

boot_df_slope <- rc |>
  specify(top_speed ~ duration) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "slope")

Warning: Removed 34 rows containing missing values.

Demo: Create a 90% confidence interval by filling in the code below.

boot_df_slope |>
  summarize(
    lower = quantile(stat, 0.05),
    upper = quantile(stat, 0.95)
  )

# A tibble: 1 × 2
    lower upper
    <dbl> <dbl>
1 -0.0195 0.132

Demo: Interpret the confidence interval in the context of the problem.

We are 90% confident that for a 1 min increase in duration, the true/population mean speed (y) is expected to be 0.0195 lower to 0.132 mph higher.

Demo: Prediction: How is this different than what we’ve done before?

Now let’s practice predicting!

Demo: Report the tidy output of your model below

speed_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(top_speed ~ duration, data = rc) 

tidy(speed_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  52.0       4.08       12.8  3.99e-24
2 duration      0.0560    0.0305      1.83 6.92e- 2

Below is the estimated model in proper notation.

\(\widehat{top~speed} = 52 + 0.056 \times duration\)

Demo: Use your model to estimate the top speed of a roller coaster if their duration is 155 minutes

duration_155 <- tibble(duration = 155)

predict(speed_fit, new_data = duration_155)

# A tibble: 1 × 1
  .pred
  <dbl>
1  60.7

Interpret

Now, interpret the slope coefficient.

For a 1 unit minute increase in duration, we estimate on average a 0.0560 increase in mph.

What if our response was logged? How does the interpretation change?

\(\widehat{log(TopSpeed)} = 52 + 0.056 \times Duration\)

Recall that to interpret the slope value we need to exponentiate it if we want to be back on the original scale!

Why?

\(log(y) = \beta_o + \beta_1x\)

y = \(exp(\beta_o + \beta_1x)\))

y = \(exp(\beta_o) * exp(\beta_1x)\))

This implies that our explanatory variable has a multiplicative relationship with our response variable

For a one minute increase in duration, we estimate on average, top speed to increase by a factor of 0.05, or by about 5%.

Research question 3

We are also interested in investigating if roller coasters after 1970 are faster than those opened before. For this question, we want to estimate the difference.

Should we make a confidence interval or hypothesis test?
Demo: Now, use bootstrapping to estimate the difference between the true speeds of roller coasters before and after 1970.

set.seed(12345)

boot_df <- rc |>
  specify(top_speed ~ opened_recently) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("yes", "no"))

Warning: Removed 11 rows containing missing values.

Your turn: Create a 99% confidence interval and interpret it in the context of the data and the research question.

boot_df |>
  summarize(
    lower = quantile(stat, 0.005),
    upper = quantile(stat, 0.995)
  )

# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1  8.88  23.3

ORDER MATTERS (yes - no)

We are 99% confident that the true mean speed for roller coasters opened after 1970 (\(\mu_1\)) is (9.15, 23) mph LARGER than the true mean speed for roller coasters opened before 1970 (\(\mu_2\))

Discussion

Brainstorm the different between bootstrap and permute. Write down the key differences below. Hint: Think back to the different sampling techniques between question 1 and question 2 / 3.

The main difference between permutation and bootstrapping is that bootstrapping resamples, with replacement from our data. Permutation can be thought of as shuffling / distributing our data / resampling without replacement.

Interpretation - Return of the Penguins

We will fit a SRL, MLR, and logistic regression model and compare how we interpret the slope coefficient.

library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following object is masked from 'package:modeldata':

    penguins

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Simple Linear Regression

modelSLR <- linear_reg() |>
  set_engine("lm") |>
  fit(flipper_length_mm ~ bill_length_mm, data = penguins)

tidy(modelSLR)

# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      127.       4.67       27.2 3.65e-87
2 bill_length_mm     1.69     0.105      16.0 1.74e-43

Interpret the slope coefficient assocciated with bill_length_mm in the context of the problem

For a 1 mm increase in bill length, we estimate on average, flipper length to increase by 1.69 mm.

Add response

Multiple Linear Regression

modelMLR <- linear_reg() |>
  set_engine("lm") |>
  fit(flipper_length_mm ~ bill_length_mm + island, data = penguins)

tidy(modelMLR)

# A tibble: 4 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)       141.      4.03       35.1  9.60e-115
2 bill_length_mm      1.51    0.0879     17.2  4.74e- 48
3 islandDream       -15.0     0.971     -15.4  4.51e- 41
4 islandTorgersen    -8.98    1.42       -6.34 7.35e- 10

Interpret the slope coefficient associated with bill_length_mm in the context of the problem

For a 1 mm increase in bill length, we estimate on average, flipper length to increase by 1.51 mm, holding island constant.

Logistic Regression

modellog <- logistic_reg() |>
  set_engine("glm") |>
  fit(sex ~ bill_length_mm, data = penguins)

tidy(modellog)

# A tibble: 2 × 5
  term           estimate std.error statistic       p.value
  <chr>             <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)      -6.04     1.01       -5.96 0.00000000247
2 bill_length_mm    0.138    0.0229      6.02 0.00000000176

Interpret the slope coefficient associated with bill_length_mm in the context of the problem and on the original scale

For a 1 unit increase in bill length, we expect the odds of being female to increase by a factor of .14, or by about 14%.

Footnotes

Roller Coaster DataBase, UltimateRollerCoaster.com, ESTEEM: Enhancing Statistics Teachers Education with E-Modules.↩︎