AE 04: Wrangling flights

Tibble vs. data frame

A tibble is an opinionated version of the R data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!

There are two main differences between a tibble and a data frame:

When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

Let’s look at the differences in the output when we type flights (tibble) in the console versus typing cars (data frame) in the console.

Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get an error message if you try to access a variable that doesn’t exist in a tibble. You will get NULL if you try to access a variable that doesn’t exist in a data frame.

flights$apple

Warning: Unknown or uninitialised column: `apple`.

NULL

cars$apple

NULL

Data wrangling with `dplyr`

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here ) for the dplyr cheatsheet.

Quick summary of key dplyr functions¹:

Rows:

filter():chooses rows based on column values.
slice(): chooses rows based on location.
arrange(): changes the order of the rows
sample_n(): take a random subset of the rows

Columns:

select(): changes whether or not a column is included.
rename(): changes the name of columns.
mutate(): changes the values of columns and creates new columns.

Groups of rows:

summarise(): collapses a group into a single row.
count(): count unique values of one or more variables.
group_by(): perform calculations separately for each value of a variable

`select()`

Demo: Make a data frame that only contains the variables dep_delay and arr_delay.

flights |>
  select(dep_delay, arr_delay)

# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# … with 336,766 more rows

Demo: Make a data frame that keeps every variable except dep_delay.

flights |>
  select(-dep_delay)

# A tibble: 336,776 × 18
    year month   day dep_time sched_dep…¹ arr_t…² sched…³ arr_d…⁴ carrier flight
   <int> <int> <int>    <int>       <int>   <int>   <int>   <dbl> <chr>    <int>
 1  2013     1     1      517         515     830     819      11 UA        1545
 2  2013     1     1      533         529     850     830      20 UA        1714
 3  2013     1     1      542         540     923     850      33 AA        1141
 4  2013     1     1      544         545    1004    1022     -18 B6         725
 5  2013     1     1      554         600     812     837     -25 DL         461
 6  2013     1     1      554         558     740     728      12 UA        1696
 7  2013     1     1      555         600     913     854      19 B6         507
 8  2013     1     1      557         600     709     723     -14 EV        5708
 9  2013     1     1      557         600     838     846      -8 B6          79
10  2013     1     1      558         600     753     745       8 AA         301
# … with 336,766 more rows, 8 more variables: tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#   ²arr_time, ³sched_arr_time, ⁴arr_delay

Demo: Make a data frame that includes all variables between year through dep_delay (inclusive). These are all variables that provide information about the departure of each flight.

flights |>
  select(year:dep_delay)

# A tibble: 336,776 × 6
    year month   day dep_time sched_dep_time dep_delay
   <int> <int> <int>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2
 2  2013     1     1      533            529         4
 3  2013     1     1      542            540         2
 4  2013     1     1      544            545        -1
 5  2013     1     1      554            600        -6
 6  2013     1     1      554            558        -4
 7  2013     1     1      555            600        -5
 8  2013     1     1      557            600        -3
 9  2013     1     1      557            600        -3
10  2013     1     1      558            600        -2
# … with 336,766 more rows

Demo: Use the select helper contains() to make a data frame that includes the variables associated with the arrival, i.e., contains the string "arr\_" in the name.

flights |>
  select(contains("arr_"))

# A tibble: 336,776 × 3
   arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>
 1      830            819        11
 2      850            830        20
 3      923            850        33
 4     1004           1022       -18
 5      812            837       -25
 6      740            728        12
 7      913            854        19
 8      709            723       -14
 9      838            846        -8
10      753            745         8
# … with 336,766 more rows

The pipe

Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.

flights |>
  select(dep_delay, arr_delay) |>
  head()

# A tibble: 6 × 2
  dep_delay arr_delay
      <dbl>     <dbl>
1         2        11
2         4        20
3         2        33
4        -1       -18
5        -6       -25
6        -4        12

`slice()`

Demo: Display the first five rows of the flights data frame.

flights |>
  slice(1:5)

# A tibble: 5 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     1     1      517         515       2     830     819      11 UA     
2  2013     1     1      533         529       4     850     830      20 UA     
3  2013     1     1      542         540       2     923     850      33 AA     
4  2013     1     1      544         545      -1    1004    1022     -18 B6     
5  2013     1     1      554         600      -6     812     837     -25 DL     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Demo: Display the last two rows of the flights data frame.

flights |>
  slice((n()-1):n())

# A tibble: 2 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     9    30       NA        1159      NA      NA    1344      NA MQ     
2  2013     9    30       NA         840      NA      NA    1020      NA MQ     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

`arrange()`

Demo: Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.

Question: What does it mean for the dep_delay to have a negative value?

flights |>
  arrange(dep_delay)

# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    12     7     2040       2123     -43      40    2352      48 B6     
 2  2013     2     3     2022       2055     -33    2240    2338     -58 DL     
 3  2013    11    10     1408       1440     -32    1549    1559     -10 EV     
 4  2013     1    11     1900       1930     -30    2233    2243     -10 DL     
 5  2013     1    29     1703       1730     -27    1947    1957     -10 F9     
 6  2013     8     9      729        755     -26    1002     955       7 MQ     
 7  2013    10    23     1907       1932     -25    2143    2143       0 EV     
 8  2013     3    30     2030       2055     -25    2213    2250     -37 MQ     
 9  2013     3     2     1431       1455     -24    1601    1631     -30 9E     
10  2013     5     5      934        958     -24    1225    1309     -44 B6     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.

flights |>
  arrange(desc(dep_delay))

# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     9      641        900    1301    1242    1530    1272 HA     
 2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
 3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
 4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
 5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
 6  2013     4    10     1100       1900     960    1342    2211     931 DL     
 7  2013     3    17     2321        810     911     135    1020     915 DL     
 8  2013     6    27      959       1900     899    1236    2226     850 DL     
 9  2013     7    22     2257        759     898     121    1026     895 DL     
10  2013    12     5      756       1700     896    1058    2020     878 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Your turn (5 minutes): Create a data frame that only includes the plane tail number (tailnum), carrier (carrier), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum) for this flight?

flights |>
  select(tailnum, carrier, dep_delay) %>%
  arrange(dep_delay) |>
  slice(1)

# A tibble: 1 × 3
  tailnum carrier dep_delay
  <chr>   <chr>       <dbl>
1 N592JB  B6            -43

`filter()`

Demo: Filter the data frame by selecting the rows where the destination airport is RDU.

flights |>
  filter(dest == "RDU")

# A tibble: 8,163 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1240       1235       5    1415    1415       0 MQ     
 9  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
10  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
# … with 8,153 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0.

flights |>
  filter(dest == "RDU", arr_delay < 0)

# A tibble: 4,232 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
 9  2013     1     1     1505       1510      -5    1654    1655      -1 MQ     
10  2013     1     1     1800       1800       0    1945    1951      -6 B6     
# … with 4,222 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

We can do more complex tasks using logical operators:

operator	definition
`<`	is less than?
`<=`	is less than or equal to?
`>`	is greater than?
`>=`	is greater than or equal to?
`==`	is exactly equal to?
`!=`	is not equal to?
`x & y`	is x AND y?
`x \\| y`	is x OR y?
`is.na(x)`	is x NA?
`!is.na(x)`	is x not NA?
`x %in% y`	is x in y?
`!(x %in% y)`	is x not in y?
`!x`	is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

Your turn (4 minutes): Describe what the code is doing in words.

flights |>
  filter(
    dest %in% c("RDU", "GSO"),
    arr_delay < 0 | dep_delay < 0
    )

# A tibble: 6,203 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
 9  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
10  2013     1     1     1505       1510      -5    1654    1655      -1 MQ     
# … with 6,193 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

`count()`

Demo: Create a frequency table of the destination locations for flights from New York.

flights |>
  count(dest)

# A tibble: 105 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     265
 3 ALB     439
 4 ANC       8
 5 ATL   17215
 6 AUS    2439
 7 AVL     275
 8 BDL     443
 9 BGR     375
10 BHM     297
# … with 95 more rows

Demo: In which month was there the fewest number of flights? How many flights were there in that month?

flights |>
  count(month) |>
  filter(n == min(n))

# A tibble: 1 × 2
  month     n
  <int> <int>
1     2 24951

Your turn (5 minutes): On which date (month + day) was there the largest number of flights? How many flights were there on that day?

flights |>
  count(month, day) |>
  filter(n == max(n))

# A tibble: 1 × 3
  month   day     n
  <int> <int> <int>
1    11    27  1014

`mutate()`

Use mutate() to create a new variable.

Demo: In the code chunk below, air_time (minutes in the air) is converted to hours, and then new variable mph is created, corresponding to the miles per hour of the flight.

flights |>
  mutate(
    hours = air_time / 60,
    mph = distance / hours
    ) |>
  select(air_time, distance, hours, mph)

# A tibble: 336,776 × 4
   air_time distance hours   mph
      <dbl>    <dbl> <dbl> <dbl>
 1      227     1400 3.78   370.
 2      227     1416 3.78   374.
 3      160     1089 2.67   408.
 4      183     1576 3.05   517.
 5      116      762 1.93   394.
 6      150      719 2.5    288.
 7      158     1065 2.63   404.
 8       53      229 0.883  259.
 9      140      944 2.33   405.
10      138      733 2.3    319.
# … with 336,766 more rows

Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?

flights |>
  count(month) |>
  mutate(perc = n / sum(n) * 100)

# A tibble: 12 × 3
   month     n  perc
   <int> <int> <dbl>
 1     1 27004  8.02
 2     2 24951  7.41
 3     3 28834  8.56
 4     4 28330  8.41
 5     5 28796  8.55
 6     6 28243  8.39
 7     7 29425  8.74
 8     8 29327  8.71
 9     9 27574  8.19
10    10 28889  8.58
11    11 27268  8.10
12    12 28135  8.35

`summarize()`

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

Be sure to name your columns!

flights |>
  summarize(mean_dep_delay = mean(dep_delay))

# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1             NA

Question: Why did this code return NA?

Let’s fix it!

flights |>
  summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1           12.6

`group_by()`

group_by() is used for grouped operations. It’s very powerful when paired with summarise() to calculate summary statistics by group.

Here we find the mean and standard deviation of departure delay for each month.

flights |>
  group_by(month) |>
  summarize(
    mean_dep_delay = mean(dep_delay, na.rm = TRUE), 
    sd_dep_delay = sd(dep_delay, na.rm = TRUE)
    )

# A tibble: 12 × 3
   month mean_dep_delay sd_dep_delay
   <int>          <dbl>        <dbl>
 1     1          10.0          36.4
 2     2          10.8          36.3
 3     3          13.2          40.1
 4     4          13.9          43.0
 5     5          13.0          39.4
 6     6          20.8          51.5
 7     7          21.7          51.6
 8     8          12.6          37.7
 9     9           6.72         35.6
10    10           6.24         29.7
11    11           5.44         27.6
12    12          16.6          41.9

Your turn (4 minutes): What is the median departure delay for each airports around NYC (origin)?

flights |>
  group_by(origin) |>
  summarize(
    med_dep_delay = median(dep_delay, na.rm = TRUE)
    )

# A tibble: 3 × 2
  origin med_dep_delay
  <chr>          <dbl>
1 EWR               -1
2 JFK               -1
3 LGA               -3

Additional Practice | Optional

Create a new dataset that only contains flights that do not have a missing departure time. Include the columns year, month, day, dep_time, dep_delay, and dep_delay_hours (the departure delay in hours). Hint: Note you may need to use mutate() to make one or more of these variables.

new.data <- flights |> 
  filter(!is.na(dep_time)) |>
  select(year, month, day, dep_time, dep_delay) |> 
  mutate(dep_delay_hours = dep_delay / 60) #assuming dep_delay is in minutes

For each airplane (uniquely identified by tailnum), use a group_by() paired with summarize() to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.

flights |>
  group_by(tailnum) |>
  summarize(n = n(), 
            mean.flight = mean(distance),
            sd.flight = sd(distance)) |>
  slice(1:5, (n() - 4):n())

# A tibble: 10 × 4
   tailnum     n mean.flight sd.flight
   <chr>   <int>       <dbl>     <dbl>
 1 D942DN      4        854.    107.  
 2 N0EGMQ    371        676.    200.  
 3 N10156    153        758.    332.  
 4 N102UW     48        536.      6.75
 5 N103US     46        535.      6.62
 6 N998AT     26        594.    186.  
 7 N998DL     77        858.    234.  
 8 N999DN     61        895.    243.  
 9 N9EAMQ    248        675.    184.  
10 <NA>     2512        710.    634.

Footnotes

From dplyr vignette ↩︎

Tibble vs. data frame

Data wrangling with dplyr

select()

The pipe

slice()

arrange()

filter()

count()

mutate()

summarize()

group_by()

Additional Practice | Optional

Footnotes

Data wrangling with `dplyr`

`select()`

`slice()`

`arrange()`

`filter()`

`count()`

`mutate()`

`summarize()`

`group_by()`