AE 04: Wrangling flights
Suggested answers
To demonstrate data wrangling we will use flights
, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.
The data frame has over 336,000 observations (rows), 336776 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
names(flights)
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
head(flights)
# A tibble: 6 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The head()
function returns “A tibble: 6 x 19” and then the first six rows of the flights
data.
Tibble vs. data frame
A tibble is an opinionated version of the R
data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!
There are two main differences between a tibble and a data frame:
- When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.
Let’s look at the differences in the output when we type flights
(tibble) in the console versus typing cars
(data frame) in the console.
- Second, tibbles are somewhat more strict than data frames when it comes to subsetting data. You will get an error message if you try to access a variable that doesn’t exist in a tibble. You will get
NULL
if you try to access a variable that doesn’t exist in a data frame.
flights$apple
Warning: Unknown or uninitialised column: `apple`.
NULL
cars$apple
NULL
Data wrangling with dplyr
dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here ) for the dplyr cheatsheet.
Quick summary of key dplyr functions1:
Rows:
-
filter()
:chooses rows based on column values. -
slice()
: chooses rows based on location. -
arrange()
: changes the order of the rows -
sample_n()
: take a random subset of the rows
Columns:
-
select()
: changes whether or not a column is included. -
rename()
: changes the name of columns. -
mutate()
: changes the values of columns and creates new columns.
Groups of rows:
-
summarise()
: collapses a group into a single row. -
count()
: count unique values of one or more variables. -
group_by()
: perform calculations separately for each value of a variable
select()
- Demo: Make a data frame that only contains the variables
dep_delay
andarr_delay
.
flights |>
select(dep_delay, arr_delay)
# A tibble: 336,776 × 2
dep_delay arr_delay
<dbl> <dbl>
1 2 11
2 4 20
3 2 33
4 -1 -18
5 -6 -25
6 -4 12
7 -5 19
8 -3 -14
9 -3 -8
10 -2 8
# … with 336,766 more rows
- Demo: Make a data frame that keeps every variable except
dep_delay
.
flights |>
select(-dep_delay)
# A tibble: 336,776 × 18
year month day dep_time sched_dep…¹ arr_t…² sched…³ arr_d…⁴ carrier flight
<int> <int> <int> <int> <int> <int> <int> <dbl> <chr> <int>
1 2013 1 1 517 515 830 819 11 UA 1545
2 2013 1 1 533 529 850 830 20 UA 1714
3 2013 1 1 542 540 923 850 33 AA 1141
4 2013 1 1 544 545 1004 1022 -18 B6 725
5 2013 1 1 554 600 812 837 -25 DL 461
6 2013 1 1 554 558 740 728 12 UA 1696
7 2013 1 1 555 600 913 854 19 B6 507
8 2013 1 1 557 600 709 723 -14 EV 5708
9 2013 1 1 557 600 838 846 -8 B6 79
10 2013 1 1 558 600 753 745 8 AA 301
# … with 336,766 more rows, 8 more variables: tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²arr_time, ³sched_arr_time, ⁴arr_delay
- Demo: Make a data frame that includes all variables between
year
throughdep_delay
(inclusive). These are all variables that provide information about the departure of each flight.
flights |>
select(year:dep_delay)
# A tibble: 336,776 × 6
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# … with 336,766 more rows
- Demo: Use the
select
helpercontains()
to make a data frame that includes the variables associated with the arrival, i.e., contains the string"arr\_"
in the name.
The pipe
Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>
, is an operator (a tool) for passing information from one process to another. We will use |>
mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.
When reading code “in English”, say “and then” whenever you see a pipe.
- Your turn (4 minutes): Run the following chunk and observe its output. Then, come up with a different way of obtaining the same output.
slice()
- Demo: Display the first five rows of the
flights
data frame.
flights |>
slice(1:5)
# A tibble: 5 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
- Demo: Display the last two rows of the
flights
data frame.
# A tibble: 2 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 9 30 NA 1159 NA NA 1344 NA MQ
2 2013 9 30 NA 840 NA NA 1020 NA MQ
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
arrange()
- Demo: Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.
Question: What does it mean for the dep_delay
to have a negative value?
flights |>
arrange(dep_delay)
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 12 7 2040 2123 -43 40 2352 48 B6
2 2013 2 3 2022 2055 -33 2240 2338 -58 DL
3 2013 11 10 1408 1440 -32 1549 1559 -10 EV
4 2013 1 11 1900 1930 -30 2233 2243 -10 DL
5 2013 1 29 1703 1730 -27 1947 1957 -10 F9
6 2013 8 9 729 755 -26 1002 955 7 MQ
7 2013 10 23 1907 1932 -25 2143 2143 0 EV
8 2013 3 30 2030 2055 -25 2213 2250 -37 MQ
9 2013 3 2 1431 1455 -24 1601 1631 -30 9E
10 2013 5 5 934 958 -24 1225 1309 -44 B6
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
- Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top.
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 9 641 900 1301 1242 1530 1272 HA
2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
5 2013 7 22 845 1600 1005 1044 1815 989 MQ
6 2013 4 10 1100 1900 960 1342 2211 931 DL
7 2013 3 17 2321 810 911 135 1020 915 DL
8 2013 6 27 959 1900 899 1236 2226 850 DL
9 2013 7 22 2257 759 898 121 1026 895 DL
10 2013 12 5 756 1700 896 1058 2020 878 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
-
Your turn (5 minutes): Create a data frame that only includes the plane tail number (
tailnum
), carrier (carrier
), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum
) for this flight?
filter()
- Demo: Filter the data frame by selecting the rows where the destination airport is RDU.
flights |>
filter(dest == "RDU")
# A tibble: 8,163 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 800 810 -10 949 955 -6 MQ
2 2013 1 1 832 840 -8 1006 1030 -24 MQ
3 2013 1 1 851 851 0 1032 1036 -4 EV
4 2013 1 1 917 920 -3 1052 1108 -16 B6
5 2013 1 1 1024 1030 -6 1204 1215 -11 MQ
6 2013 1 1 1127 1129 -2 1303 1309 -6 EV
7 2013 1 1 1157 1205 -8 1342 1345 -3 MQ
8 2013 1 1 1240 1235 5 1415 1415 0 MQ
9 2013 1 1 1317 1325 -8 1454 1505 -11 MQ
10 2013 1 1 1449 1450 -1 1651 1640 11 MQ
# … with 8,153 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
- Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0.
flights |>
filter(dest == "RDU", arr_delay < 0)
# A tibble: 4,232 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 800 810 -10 949 955 -6 MQ
2 2013 1 1 832 840 -8 1006 1030 -24 MQ
3 2013 1 1 851 851 0 1032 1036 -4 EV
4 2013 1 1 917 920 -3 1052 1108 -16 B6
5 2013 1 1 1024 1030 -6 1204 1215 -11 MQ
6 2013 1 1 1127 1129 -2 1303 1309 -6 EV
7 2013 1 1 1157 1205 -8 1342 1345 -3 MQ
8 2013 1 1 1317 1325 -8 1454 1505 -11 MQ
9 2013 1 1 1505 1510 -5 1654 1655 -1 MQ
10 2013 1 1 1800 1800 0 1945 1951 -6 B6
# … with 4,222 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
We can do more complex tasks using logical operators:
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x \| y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x
is logical (TRUE / FALSE).
- Your turn (4 minutes): Describe what the code is doing in words.
# A tibble: 6,203 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 800 810 -10 949 955 -6 MQ
2 2013 1 1 832 840 -8 1006 1030 -24 MQ
3 2013 1 1 851 851 0 1032 1036 -4 EV
4 2013 1 1 917 920 -3 1052 1108 -16 B6
5 2013 1 1 1024 1030 -6 1204 1215 -11 MQ
6 2013 1 1 1127 1129 -2 1303 1309 -6 EV
7 2013 1 1 1157 1205 -8 1342 1345 -3 MQ
8 2013 1 1 1317 1325 -8 1454 1505 -11 MQ
9 2013 1 1 1449 1450 -1 1651 1640 11 MQ
10 2013 1 1 1505 1510 -5 1654 1655 -1 MQ
# … with 6,193 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
count()
- Demo: Create a frequency table of the destination locations for flights from New York.
flights |>
count(dest)
# A tibble: 105 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 265
3 ALB 439
4 ANC 8
5 ATL 17215
6 AUS 2439
7 AVL 275
8 BDL 443
9 BGR 375
10 BHM 297
# … with 95 more rows
- Demo: In which month was there the fewest number of flights? How many flights were there in that month?
- Your turn (5 minutes): On which date (month + day) was there the largest number of flights? How many flights were there on that day?
mutate()
Use mutate()
to create a new variable.
- Demo: In the code chunk below,
air_time
(minutes in the air) is converted to hours, and then new variablemph
is created, corresponding to the miles per hour of the flight.
flights |>
mutate(
hours = air_time / 60,
mph = distance / hours
) |>
select(air_time, distance, hours, mph)
# A tibble: 336,776 × 4
air_time distance hours mph
<dbl> <dbl> <dbl> <dbl>
1 227 1400 3.78 370.
2 227 1416 3.78 374.
3 160 1089 2.67 408.
4 183 1576 3.05 517.
5 116 762 1.93 394.
6 150 719 2.5 288.
7 158 1065 2.63 404.
8 53 229 0.883 259.
9 140 944 2.33 405.
10 138 733 2.3 319.
# … with 336,766 more rows
- Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
summarize()
summarize()
collapses the rows into summary statistics and removes columns irrelevant to the calculation.
Be sure to name your columns!
Question: Why did this code return NA
?
Let’s fix it!
group_by()
group_by()
is used for grouped operations. It’s very powerful when paired with summarise()
to calculate summary statistics by group.
Here we find the mean and standard deviation of departure delay for each month.
flights |>
group_by(month) |>
summarize(
mean_dep_delay = mean(dep_delay, na.rm = TRUE),
sd_dep_delay = sd(dep_delay, na.rm = TRUE)
)
# A tibble: 12 × 3
month mean_dep_delay sd_dep_delay
<int> <dbl> <dbl>
1 1 10.0 36.4
2 2 10.8 36.3
3 3 13.2 40.1
4 4 13.9 43.0
5 5 13.0 39.4
6 6 20.8 51.5
7 7 21.7 51.6
8 8 12.6 37.7
9 9 6.72 35.6
10 10 6.24 29.7
11 11 5.44 27.6
12 12 16.6 41.9
-
Your turn (4 minutes): What is the median departure delay for each airports around NYC (
origin
)?
Additional Practice | Optional
- Create a new dataset that only contains flights that do not have a missing departure time. Include the columns
year
,month
,day
,dep_time
,dep_delay
, anddep_delay_hours
(the departure delay in hours). Hint: Note you may need to usemutate()
to make one or more of these variables.
- For each airplane (uniquely identified by
tailnum
), use agroup_by()
paired withsummarize()
to find the sample size, mean, and standard deviation of flight distances. Then include only the top 5 and bottom 5 airplanes in terms of mean distance traveled per flight in the final data frame.
flights |>
group_by(tailnum) |>
summarize(n = n(),
mean.flight = mean(distance),
sd.flight = sd(distance)) |>
slice(1:5, (n() - 4):n())
# A tibble: 10 × 4
tailnum n mean.flight sd.flight
<chr> <int> <dbl> <dbl>
1 D942DN 4 854. 107.
2 N0EGMQ 371 676. 200.
3 N10156 153 758. 332.
4 N102UW 48 536. 6.75
5 N103US 46 535. 6.62
6 N998AT 26 594. 186.
7 N998DL 77 858. 234.
8 N999DN 61 895. 243.
9 N9EAMQ 248 675. 184.
10 <NA> 2512 710. 634.
Footnotes
From dplyr vignette↩︎