Regression with a Single Predictor

Before getting started, make sure that these two packages are installed:

palmerpenguins and tidymodels

Today, we will be studying penguins. Please read the following context and take a glimpse at the data set before we get started.

This data set comprising various measurements of three different penguin species, namely Adelie, Gentoo, and Chinstrap. The rigorous study was conducted in the islands of the Palmer Archipelago, Antarctica. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data set is called penguins.

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

We want to understand more about a penguin’s body mass. First, we are going to investigate the relationship between a penguin’s flipper length and their body mass.

body mass

penguins |> 
  ggplot( 
    aes(y = body_mass_g, x = flipper_length_mm)) + 
  geom_point() + 
  geom_smooth(method = "lm" , fullrange = T) + 
  xlim(0,240)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 2 rows containing non-finite values (stat_smooth).
Warning: Removed 2 rows containing missing values (geom_point).

Model these Data

  • Write the population model below that explains the relationship between body mass and flipper length. Hint: You can type equations within dollar sign code chunks seen below.

$$

beta

<- put a hat on \

<- this is beta

{ }

$$

  • Now, fit the linear regression model and display the results. Write the estimated model output below.
test <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins) |>
  tidy()

$$

= -5781 + 49.7*(flipper) \

BodyMass = _o + _1*flipper + _i

$$

  • Interpret the slope and the intercept in the context of the data.

Slope: “A 1 unit increase in x” “on average”

For a 1 mm increase in flipper length, we expect on average an estimated 49.7 gram increase is body mass.

Intercept: When flipper length is 0, we estimate on average a body mass of -5791 grams.

  • What is the estimated body mass for a penguin with a flipper length of 210?

= -5781 + 49.7*(200)

= -5781 + 49.7*(100)

  • What is the estimated body mass for a penguin with a flipper length of 100?

Next Question

  • Now, we will investigate another question. A different researcher wants to look at body weight of penguins based on the island they were recorded on. What’s different between this question and the last? Hint: Think about the variable type.

Our explanatory variable is now categorical

  • Make an appropriate visualization to investigate this relationship below. Additionally, calculate the mean body mass by island below.
penguins |> 
  ggplot(
    aes(y = body_mass_g, x = island)
  ) + 
  geom_point()
Warning: Removed 2 rows containing missing values (geom_point).

  • Change the geom of your previous plot to geom_point. Use this plot to think about how R models these data.

The minimize the residuals within each group, our model should go through the mean! Not only does this minimize the residual sums of squares, but it also logically is our best guessed for a random penguin body mass.

  • Now, fit the linear regression model and display the results. Write the estimated model output below.
linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island, data = penguins) |>
  tidy()
# A tibble: 3 × 5
  term            estimate std.error statistic   p.value
  <chr>              <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        4716.      48.5      97.3 8.93e-250
2 islandDream       -1003.      74.2     -13.5 1.42e- 33
3 islandTorgersen   -1010.     100.      -10.1 4.66e- 21
  • Interpret each coefficient below in the context of the problem.

  • What is the estimated body weight of a penguin on Dream island?

The estimated body mass of a penguin, on average, that is located on the dream island is (4716 - 1003)g.

  • What is the estimated body weight of a penguin on Biscoe island?

The estimated body mass of a penguin, on average, that is located on the Biscoe islad is 4716 g.

Optional

Ask your own question and explore it!