AE 07: Hotel bookings

Application exercise
Important

Go to the course GitHub organization and locate the repo titled ae-07-YOUR_GITHUB_USERNAME to get started.

This AE is due Saturday, Sep 25 at 11:59pm.

Packages

We will use the following two packages in this application exercise.

  • tidyverse: For data import, wrangling, and visualization.
  • skimr: For summarizing the entire data frame at once.
  • scales: For better axis labels.

To be productive in R, you need to be familiar with the major types and the operations on these types. Each R object has a un underlying “type”, which determines the set of possible values for that object. You can find the type of an object using the typeof function.

logical: a logical value.

integer: an integer (positive or negative). Many R programmers do not use this mode since every integer value can be represented as a double.

double: a real number stored in “double-precision floatint point format.”

complex: a complex number

character: a sequence of characters, called a “string” in other programming languages

list: a list of named values

NULL: a special type with only one possible value, known as NULL

More information can be found here: https://statsandr.com/blog/data-types-in-r/

Why This Matters

We are going to revisit the mtcars data set. Run ?mtcars to see the definition of each variable.

library(tidyverse)
library(scales)
library(skimr)

mtcars07 <- read_csv("data/mtcars07.csv" , col_types = NULL)


glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

3-min Run the code below to create side-by-side boxplots of the number of mpg cars get versus the type of Engine they have.

mtcars |>
ggplot(
  aes(x = vs, y = mpg)
) +
  geom_boxplot()
Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Why doesn’t this work?

Answer Here

Edit the code below to fix the issue.

mtcars |>
  mutate(vs = as.factor(vs)) |>
ggplot(
  aes(x = vs, y = mpg)
) +
  geom_boxplot()

Now, calculate the mean number of carburetors for the 32 cars in the data set.

Why doesn’t this work? Fix the code so you can answer the question.

Type coercion

  • Demo: Determine the type of the following vector. And then, change the type to numeric.

    x <- c("1", "2", "3")
    typeof(x)
    [1] "character"
    [1] 1 2 3
  • Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?

    y <- c("a", "b", "c")
    typeof(y)
    [1] "character"
    Warning: NAs introduced by coercion
    [1] NA NA NA
  • Demo: Once again, determine the type of the following vector. And then, change the type to numeric. What’s different than the previous exercise?

    z <- c("1", "2", "three")
    typeof(z)
    [1] "character"
    Warning: NAs introduced by coercion
    [1]  1  2 NA
  • Demo: Suppose you conducted a survey where you asked people how many cars their household owns collectively. And the answers are as follows:

    survey_results <- tibble(cars = c(1, 2, "three"))
    survey_results
    # A tibble: 3 × 1
      cars 
      <chr>
    1 1    
    2 2    
    3 three

    This is annoying because of that third survey taker who just had to go and type out the number instead of providing as a numeric value. So now you need to update the cars variable to be numeric. You do the following

    survey_results |>
      mutate(cars = as.numeric(cars))
    Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
    # A tibble: 3 × 1
       cars
      <dbl>
    1     1
    2     2
    3    NA

    And now things are even more annoying because you get a warning NAs introduced by coercion that happened while computing cars = as.numeric(cars) and the response from the third survey taker is now an NA (you lost their data). Fix your mutate() call to avoid this warning.

    survey_results |>
      mutate(
        cars = if_else(cars == "three", "3", cars),
        cars = as.numeric(cars)
        )
    # A tibble: 3 × 1
       cars
      <dbl>
    1     1
    2     2
    3     3
  • Your turn: First, guess the type of the vector. Then, check if you guessed right. I’ve done the first one for you, you’ll see that it’s helpful to check the type of each element of the vector first.

    • c(1, 1L, "C")
        v1 <- c(1, 1L, "C")

        # to help you guess
        typeof(1)
[1] "double"
        typeof(1L)
[1] "integer"
        typeof("C")
[1] "character"
        # to check after you guess
        typeof(v1)
[1] "character"
-   `c(1L / 0, "A")`
        v2 <- c(1L / 0, "A")

        # to help you guess
        typeof(1L)
[1] "integer"
        typeof(0)
[1] "double"
        typeof(1L / 0)
[1] "double"
        typeof("A")
[1] "character"
        # to check after you guess
        typeof(v2)
[1] "character"
  • c(1:3, 5)
        v3 <- c(1:3, 5)

        # to help you guess
        typeof(1:3)
[1] "integer"
        typeof(5)
[1] "double"
        # to check after you guess
        typeof(v3)
[1] "double"
-   `c(3, "3+")`
        v4 <- c(3, "3+")

        # to help you guess
        typeof(3)
[1] "double"
        typeof("3+")
[1] "character"
        # to check after you guess
        typeof(v4)
[1] "character"
-   `c(NA, TRUE)`
        v5 <- c(NA, TRUE)

        # to help you guess
        typeof(NA)
[1] "logical"
        typeof(TRUE)
[1] "logical"
        # to check after you guess
        typeof(v5)
[1] "logical"

Hotel bookings

hotels <- read_csv("data/hotels.csv" , col_types = NULL)
Warning: One or more parsing issues, see `problems()` for details

After reading in the data set, you should see a Warning message. What does that message say? Explain the output of problems() in your own words.

Take a look at the the following visualization. How are the months ordered? What would be a better order? Then, reorder the months on the x-axis (levels of arrival_date_month) in a way that makes more sense. You will want to use a function from the forcats package, see https://forcats.tidyverse.org/reference/index.html for inspiration and help.

Hints:

– use fct_relevel to order months

– use case_when to fix the input error

– calculate mean adr for each group for plot

– use theme_minimal

hotels |>
  mutate(
  arrival_date_month = fct_relevel(arrival_date_month, "January", "February", "March", "April", "May", "June",
              "July", "August", "September", "October", "November", "December")) |>
  mutate(adr = case_when(
    is.na(adr) ~ 124,
    TRUE ~ as.numeric(adr)
  )) |>
  group_by(hotel, arrival_date_month) |>
  summarise(mean_adr = mean(adr)) |>       # calculate mean adr for each group
  ggplot(aes(
    x = arrival_date_month,                 # x-axis = arrival_date_month
    y = mean_adr,                           # y-axis = mean_adr calculated above
    group = hotel,                          # group lines by hotel type
    color = hotel)                          # and color by hotel type
  ) +
  geom_line() +                             # use lines to represent data
  theme_minimal() +                         # use a minimal theme
  labs(
    x = "Arrival month",                 # customize labels
    y = "Mean ADR (average daily rate)",
    title = "Comparison of resort and city hotel prices across months",
    subtitle = "Resort hotel prices soar in the summer while ciry hotel prices remain relatively constant throughout the year",
    color = "Hotel type"
  ) +
  scale_y_continuous(labels = label_dollar())
`summarise()` has grouped output by 'hotel'. You can override using the
`.groups` argument.

Stretch goal: If you finish the above task before time is up, change the above code so that the y-axis labels are shown with dollar signs, e.g. $80 instead of 80. You will want to use a function from the scales package, see https://scales.r-lib.org/reference/index.html for inspiration and help.