AE 06: Pivoting StatSci Majors
Suggested answers
Goal
Our ultimate goal in this application exercise is to make the following data visualization.
- Your turn (3 minutes): Take a close look at the plot and describe what it shows in 2-3 sentences.
This plot describes the relationship between a graduation year and the number of majors graduating. This relationsip is broken down by degree type. It appears that the number of majors graduating’s relationship with graduation year changes by degree type, with BS and BS2 majors having more graduating in later years than the others.
Data
The data come from the Office of the University Registrar. They make the data available as a table that you can download as a PDF, but I’ve put the data exported in a CSV file for you. Let’s load that in.
And let’s take a look at the data.
statsci
# A tibble: 4 × 14
...1 ...2 degree `2011` `2012` `2013` `2014` `2015` `2016` `2017` `2018`
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 Statistic… NA 1 NA NA 4 4 1 NA
2 2 2 Statistic… 2 2 4 1 3 6 3 4
3 3 3 Statistic… 2 6 1 NA 5 6 6 8
4 4 4 Statistic… 5 9 4 13 10 17 24 21
# … with 3 more variables: `2019` <dbl>, `2020` <dbl>, `2021` <dbl>
The dataset has 4 rows and 14 columns. The first column (variable) is the degree
, and there are 4 possible degrees: BS (Bachelor of Science), BS2 (Bachelor of Science, 2nd major), AB (Bachelor of Arts), AB2 (Bachelor of Arts, 2nd major). The remaining columns show the number of students graduating with that major in a given academic year from 2011 to 2021.
-
Your turn (4 minutes): Take a look at the plot we aim to make and sketch the data frame we need to make the plot. Determine what each row and each column of the data frame should be. Hint: We need data to be in columns to map to
aes
thetic elements of the plot.Columns:
year
,n
,degree_type
Rows: Combination of year and degree type
Pivoting
-
Demo: Pivot the
statsci
data frame longer such that each row represents a degree type / year combination andyear
andn
umber of graduates for that year are columns in the data frame.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
values_to = "n"
)
# A tibble: 52 × 3
degree year n
<chr> <chr> <dbl>
1 Statistical Science (AB2) ...1 1
2 Statistical Science (AB2) ...2 1
3 Statistical Science (AB2) 2011 NA
4 Statistical Science (AB2) 2012 1
5 Statistical Science (AB2) 2013 NA
6 Statistical Science (AB2) 2014 NA
7 Statistical Science (AB2) 2015 4
8 Statistical Science (AB2) 2016 4
9 Statistical Science (AB2) 2017 1
10 Statistical Science (AB2) 2018 NA
# … with 42 more rows
-
Question: What is the type of the
year
variable? Why? What should it be?
It’s a character (chr
) variable since the information came from the columns of the original data frame and R cannot know that these character strings represent years. The variable type should be numeric.
- Demo: Start over with pivoting, and this time also make sure
year
is a numerical variable in the resulting data frame.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
)
Warning in f(names[[col]]): NAs introduced by coercion
# A tibble: 52 × 3
degree year n
<chr> <dbl> <dbl>
1 Statistical Science (AB2) NA 1
2 Statistical Science (AB2) NA 1
3 Statistical Science (AB2) 2011 NA
4 Statistical Science (AB2) 2012 1
5 Statistical Science (AB2) 2013 NA
6 Statistical Science (AB2) 2014 NA
7 Statistical Science (AB2) 2015 4
8 Statistical Science (AB2) 2016 4
9 Statistical Science (AB2) 2017 1
10 Statistical Science (AB2) 2018 NA
# … with 42 more rows
-
Question: What does an
NA
mean in this context? Hint: The data come from the university registrar, and they have records on every single graduates, there shouldn’t be anything “unknown” to them about who graduated when.
NA
s should actually be 0s.
-
Demo: Add on to your pipeline that you started with pivoting and convert
NA
s inn
to0
s.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
) |>
mutate(n = if_else(is.na(n), 0, n))
Warning in f(names[[col]]): NAs introduced by coercion
# A tibble: 52 × 3
degree year n
<chr> <dbl> <dbl>
1 Statistical Science (AB2) NA 1
2 Statistical Science (AB2) NA 1
3 Statistical Science (AB2) 2011 0
4 Statistical Science (AB2) 2012 1
5 Statistical Science (AB2) 2013 0
6 Statistical Science (AB2) 2014 0
7 Statistical Science (AB2) 2015 4
8 Statistical Science (AB2) 2016 4
9 Statistical Science (AB2) 2017 1
10 Statistical Science (AB2) 2018 0
# … with 42 more rows
-
Demo: In our plot the degree types are BS, BS2, AB, and AB2. This information is in our dataset, in the
degree
column, but this column also has additional characters we don’t need. Create a new column calleddegree_type
with levels BS, BS2, AB, and AB2 (in this order) based ondegree
. Do this by adding on to your pipeline from earlier.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
) |>
mutate(n = if_else(is.na(n), 0, n)) |>
separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
mutate(
degree_type = str_remove(degree_type, "\\)"),
degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
)
Warning in f(names[[col]]): NAs introduced by coercion
# A tibble: 52 × 4
major degree_type year n
<chr> <fct> <dbl> <dbl>
1 Statistical Science AB2 NA 1
2 Statistical Science AB2 NA 1
3 Statistical Science AB2 2011 0
4 Statistical Science AB2 2012 1
5 Statistical Science AB2 2013 0
6 Statistical Science AB2 2014 0
7 Statistical Science AB2 2015 4
8 Statistical Science AB2 2016 4
9 Statistical Science AB2 2017 1
10 Statistical Science AB2 2018 0
# … with 42 more rows
- Your turn (5 minutes): Now we start making our plot, but let’s not get too fancy right away. Create the following plot, which will serve as the “first draft” on the way to our Goal. Do this by adding on to your pipeline from earlier.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
) |>
mutate(n = if_else(is.na(n), 0, n)) |>
separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
mutate(
degree_type = str_remove(degree_type, "\\)"),
degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
) |>
ggplot(aes(x = year, y = n, color = degree_type)) +
geom_point() +
geom_line()
Warning in f(names[[col]]): NAs introduced by coercion
Warning: Removed 8 rows containing missing values (geom_point).
Warning: Removed 8 row(s) containing missing values (geom_path).
-
Your turn (4 minutes): What aspects of the plot need to be updated to go from the draft you created above to the Goal plot at the beginning of this application exercise.
x-axis scale: need to go from 2011 to 2021 in increments of 2 years
line colors
axis labels: title, subtitle, x, y, caption
theme
legend position and border
- Demo: Update x-axis scale such that the years displayed go from 2011 to 2021 in increments of 2 years. Do this by adding on to your pipeline from earlier.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
) |>
mutate(n = if_else(is.na(n), 0, n)) |>
separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
mutate(
degree_type = str_remove(degree_type, "\\)"),
degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
) |>
ggplot(aes(x = year, y = n, color = degree_type)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(2011, 2021, 2))
Warning in f(names[[col]]): NAs introduced by coercion
Warning: Removed 8 rows containing missing values (geom_point).
Warning: Removed 8 row(s) containing missing values (geom_path).
-
Demo: Update line colors using the following level / color assignments. Once again, do this by adding on to your pipeline from earlier.
“BS” = “cadetblue4”
“BS2” = “cadetblue3”
“AB” = “lightgoldenrod4”
“AB2” = “lightgoldenrod3”
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
) |>
mutate(n = if_else(is.na(n), 0, n)) |>
separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
mutate(
degree_type = str_remove(degree_type, "\\)"),
degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
) |>
ggplot(aes(x = year, y = n, color = degree_type)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(2011, 2021, 2)) +
scale_color_manual(
values = c("BS" = "cadetblue4",
"BS2" = "cadetblue3",
"AB" = "lightgoldenrod4",
"AB2" = "lightgoldenrod3"))
Warning in f(names[[col]]): NAs introduced by coercion
Warning: Removed 8 rows containing missing values (geom_point).
Warning: Removed 8 row(s) containing missing values (geom_path).
-
Your turn (4 minutes): Update the plot labels (
title
,subtitle
,x
,y
, andcaption
) and usetheme_minimal()
. Once again, do this by adding on to your pipeline from earlier.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
) |>
mutate(n = if_else(is.na(n), 0, n)) |>
separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
mutate(
degree_type = str_remove(degree_type, "\\)"),
degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
) |>
ggplot(aes(x = year, y = n, color = degree_type)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(2011, 2021, 2)) +
scale_color_manual(
values = c("BS" = "cadetblue4",
"BS2" = "cadetblue3",
"AB" = "lightgoldenrod4",
"AB2" = "lightgoldenrod3")) +
labs(
x = "Graduation year",
y = "Number of majors graduating",
color = "Degree type",
title = "Statistical Science majors over the years",
subtitle = "Academic years 2011 - 2021",
caption = "Source: Office of the University Registrar\nhttps://registrar.duke.edu/registration/enrollment-statistics"
) +
theme_minimal()
Warning in f(names[[col]]): NAs introduced by coercion
Warning: Removed 8 rows containing missing values (geom_point).
Warning: Removed 8 row(s) containing missing values (geom_path).
-
Demo: Finally, adding to your pipeline you’ve developed so far, move the legend into the plot, make its background white, and its border gray. Set
fig-width: 7
andfig-height: 5
for your plot in the chunk options.
statsci |>
pivot_longer(
cols = -degree,
names_to = "year",
names_transform = as.numeric,
values_to = "n"
) |>
mutate(n = if_else(is.na(n), 0, n)) |>
separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
mutate(
degree_type = str_remove(degree_type, "\\)"),
degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
) |>
ggplot(aes(x = year, y = n, color = degree_type)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(2011, 2021, 2)) +
scale_color_manual(
values = c("BS" = "cadetblue4",
"BS2" = "cadetblue3",
"AB" = "lightgoldenrod4",
"AB2" = "lightgoldenrod3")) +
labs(
x = "Graduation year",
y = "Number of majors graduating",
color = "Degree type",
title = "Statistical Science majors over the years",
subtitle = "Academic years 2011 - 2021",
caption = "Source: Office of the University Registrar\nhttps://registrar.duke.edu/registration/enrollment-statistics"
) +
theme_minimal() +
theme(
legend.position = c(0.2, 0.8),
legend.background = element_rect(fill = "white", color = "gray")
)
Warning in f(names[[col]]): NAs introduced by coercion
Warning: Removed 8 rows containing missing values (geom_point).
Warning: Removed 8 row(s) containing missing values (geom_path).