Inference overview + Review

rc <- read_csv("data/roller_coasters.csv")

In this application exercise, we will be looking at a roller coaster and amusement park database by Duane Marden.1 This database records multiple features of roller coasters. For the purpose of this activity, we will work with a random sample of 157 roller coasters.

Variable Description
age_group

1: Older (Built between 1900-1979)

2: Recent (1980-1999)

3: Newest (2000-current)

coaster Name of the roller coaster
park Name of the park where the roller coaster is located
city City where the roller coaster is located
state State where the roller coaster is located
type Material of track (Steel or Wooden)
year_opened Year when roller coaster opened
top_speed Maximum speed of roller coaster (mph)
max_height Highest point of roller coaster (ft)
drop Length of largest gap between high and low points of roller coaster (ft)
length Length of roller coaster track (ft)
duration Time length of roller coaster ride (sec)
inversions Whether or not roller coaster flips passengers at any point (Yes or No)
num_of_inversions Number of times roller coaster flips passengers

Practice with data manipulation

  • Demo: Create a new variable called opened_recently that has a response of yes if the roller coaster was opened after 1970, and no if the roller coaster was built during or before 1970. Save this new factor variable in the rc data set. Additionally, make type a factor variable.
rc <- rc |>
  mutate(
    opened_recently = if_else(year_opened > 1970, "yes", "no"), 
    opened_recently = as.factor(opened_recently),
    type = as.factor(type)
  ) 

We want to investigate the relationship between how fast a roller coaster goes, and how long a roller coaster lasts. Specifically, we are interested in how well the duration of a roller coaster explains how fast it is.

  • Your turn: Based on this research question, which two variables should we use from our data set? Which is our response?

top_speed; duration

top_speed is our response

  • Your turn: Describe how we can construct a bootstrap distribution for the slope of the model predicting top_speed from duration.

  • Demo: Now, construct this bootstrap distribution.

set.seed(12345)

boot_df_slope <- rc |>
  specify(top_speed ~ duration) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "slope")
Warning: Removed 34 rows containing missing values.
  • Demo: Create a 90% confidence interval by filling in the code below.
boot_df_slope |>
  summarize(
    lower = quantile(stat, 0.05),
    upper = quantile(stat, 0.95)
  )
# A tibble: 1 × 2
    lower upper
    <dbl> <dbl>
1 -0.0195 0.132
  • Demo: Interpret the confidence interval in the context of the problem.

We are 90% confident that for a 1 min increase in duration, the true/population mean speed (y) is expected to be 0.0195 lower to 0.132 mph higher.

  • Demo: Prediction: How is this different than what we’ve done before?

Now let’s practice predicting!

  • Demo: Report the tidy output of your model below
speed_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(top_speed ~ duration, data = rc) 

tidy(speed_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  52.0       4.08       12.8  3.99e-24
2 duration      0.0560    0.0305      1.83 6.92e- 2
  • Below is the estimated model in proper notation.

\(\widehat{top~speed} = 52 + 0.056 \times duration\)

  • Demo: Use your model to estimate the top speed of a roller coaster if their duration is 155 minutes
duration_155 <- tibble(duration = 155)

predict(speed_fit, new_data = duration_155)
# A tibble: 1 × 1
  .pred
  <dbl>
1  60.7

Interpret

  • Now, interpret the slope coefficient.

For a 1 unit minute increase in duration, we estimate on average a 0.0560 increase in mph.

  • What if our response was logged? How does the interpretation change?

\(\widehat{log(TopSpeed)} = 52 + 0.056 \times Duration\)

Recall that to interpret the slope value we need to exponentiate it if we want to be back on the original scale!

Why?

\(log(y) = \beta_o + \beta_1x\)

y = \(exp(\beta_o + \beta_1x)\))

y = \(exp(\beta_o) * exp(\beta_1x)\))

This implies that our explanatory variable has a multiplicative relationship with our response variable

For a one minute increase in duration, we estimate on average, top speed to increase by a factor of 0.05, or by about 5%.

Research question 3

We are also interested in investigating if roller coasters after 1970 are faster than those opened before. For this question, we want to estimate the difference.

  • Should we make a confidence interval or hypothesis test?

  • Demo: Now, use bootstrapping to estimate the difference between the true speeds of roller coasters before and after 1970.

set.seed(12345)

boot_df <- rc |>
  specify(top_speed ~ opened_recently) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("yes", "no"))
Warning: Removed 11 rows containing missing values.
  • Your turn: Create a 99% confidence interval and interpret it in the context of the data and the research question.
boot_df |>
  summarize(
    lower = quantile(stat, 0.005),
    upper = quantile(stat, 0.995)
  )
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1  8.88  23.3

ORDER MATTERS (yes - no)

We are 99% confident that the true mean speed for roller coasters opened after 1970 (\(\mu_1\)) is (9.15, 23) mph LARGER than the true mean speed for roller coasters opened before 1970 (\(\mu_2\))

Discussion

  • Brainstorm the different between bootstrap and permute. Write down the key differences below. Hint: Think back to the different sampling techniques between question 1 and question 2 / 3.

The main difference between permutation and bootstrapping is that bootstrapping resamples, with replacement from our data. Permutation can be thought of as shuffling / distributing our data / resampling without replacement.

Interpretation - Return of the Penguins

We will fit a SRL, MLR, and logistic regression model and compare how we interpret the slope coefficient.


Attaching package: 'palmerpenguins'
The following object is masked from 'package:modeldata':

    penguins
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Simple Linear Regression

modelSLR <- linear_reg() |>
  set_engine("lm") |>
  fit(flipper_length_mm ~ bill_length_mm, data = penguins)

tidy(modelSLR)
# A tibble: 2 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      127.       4.67       27.2 3.65e-87
2 bill_length_mm     1.69     0.105      16.0 1.74e-43
  • Interpret the slope coefficient assocciated with bill_length_mm in the context of the problem

For a 1 mm increase in bill length, we estimate on average, flipper length to increase by 1.69 mm.

Add response

Multiple Linear Regression

modelMLR <- linear_reg() |>
  set_engine("lm") |>
  fit(flipper_length_mm ~ bill_length_mm + island, data = penguins)

tidy(modelMLR)
# A tibble: 4 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)       141.      4.03       35.1  9.60e-115
2 bill_length_mm      1.51    0.0879     17.2  4.74e- 48
3 islandDream       -15.0     0.971     -15.4  4.51e- 41
4 islandTorgersen    -8.98    1.42       -6.34 7.35e- 10
  • Interpret the slope coefficient associated with bill_length_mm in the context of the problem

For a 1 mm increase in bill length, we estimate on average, flipper length to increase by 1.51 mm, holding island constant.

Logistic Regression

modellog <- logistic_reg() |>
  set_engine("glm") |>
  fit(sex ~ bill_length_mm, data = penguins)

tidy(modellog)
# A tibble: 2 × 5
  term           estimate std.error statistic       p.value
  <chr>             <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)      -6.04     1.01       -5.96 0.00000000247
2 bill_length_mm    0.138    0.0229      6.02 0.00000000176
  • Interpret the slope coefficient associated with bill_length_mm in the context of the problem and on the original scale

For a 1 unit increase in bill length, we expect the odds of being female to increase by a factor of .14, or by about 14%.