Exam Review
Packages
We will use the following two packages in this application exercise.
- tidyverse: For data import, wrangling, and visualization.
For the remaining time, we will practice data wrangling with dplyr. We will be using the Student Exams data set. This is fictional data. The dataset is not from a real context, and the data set does not represent real people. The purpose of this data set is to teach data science and practice using R functions.
studentexams <- read_csv("data/StudentsPerformance.csv")
Rows: 1000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): gender, race/ethnicity, parental level of education, lunch, test pr...
dbl (3): math score, reading score, writing score
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, let’s take a glimpse
at our data.
glimpse(studentexams)
Rows: 1,000
Columns: 8
$ gender <chr> "female", "female", "female", "male", "m…
$ `race/ethnicity` <chr> "group B", "group C", "group B", "group …
$ `parental level of education` <chr> "bachelor's degree", "some college", "ma…
$ lunch <chr> "standard", "standard", "standard", "fre…
$ `test preparation course` <chr> "none", "completed", "none", "none", "no…
$ `math score` <dbl> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, …
$ `reading score` <dbl> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, …
$ `writing score` <dbl> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, …
Identify the variable names. Identify their type.
Inline code example There are 1000 number of rows in the data set and 8 columns in the data set.
Some variable names have spaces. This won’t work. Let’s clean these up using rename
.
rename()
changes the name of columns.
studentexams <- rename(studentexams, math_score = `math score`,
reading_score = `reading score`,
writing_score = `writing score`,
parental_level_of_education = `parental level of education`)
filter()
:chooses rows based on column values.
Filter these data so that it only contains rows where math scores are at or equal to 70.
Then, filter the data to only look at student performance if they received standard lunch.
studentexams |>
filter(math_score >= 70)
# A tibble: 409 × 8
gender `race/ethnicity` parental_leve…¹ lunch test …² math_…³ readi…⁴ writi…⁵
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 female group B bachelor's deg… stan… none 72 72 74
2 female group B master's degree stan… none 90 95 93
3 male group C some college stan… none 76 78 75
4 female group B associate's de… stan… none 71 83 78
5 female group B some college stan… comple… 88 95 92
6 male group A some college stan… comple… 78 72 70
7 male group C high school stan… none 88 89 86
8 male group D bachelor's deg… free… comple… 74 71 80
9 male group A master's degree free… none 73 74 72
10 male group C high school stan… none 70 70 65
# … with 399 more rows, and abbreviated variable names
# ¹parental_level_of_education, ²`test preparation course`, ³math_score,
# ⁴reading_score, ⁵writing_score
studentexams |>
filter(lunch == "standard")
# A tibble: 645 × 8
gender `race/ethnicity` parental_leve…¹ lunch test …² math_…³ readi…⁴ writi…⁵
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 female group B bachelor's deg… stan… none 72 72 74
2 female group C some college stan… comple… 69 90 88
3 female group B master's degree stan… none 90 95 93
4 male group C some college stan… none 76 78 75
5 female group B associate's de… stan… none 71 83 78
6 female group B some college stan… comple… 88 95 92
7 male group C associate's de… stan… none 58 54 52
8 male group D associate's de… stan… none 40 52 43
9 female group B high school stan… none 65 81 73
10 male group A some college stan… comple… 78 72 70
# … with 635 more rows, and abbreviated variable names
# ¹parental_level_of_education, ²`test preparation course`, ³math_score,
# ⁴reading_score, ⁵writing_score
Why does == work here but not in the homework? Check Sakai for a more detailed explanation!
mutate()
changes the values of columns and creates new columns. Let’s use this with if else to create a new variable. Create a new variable called math_pass. Have it display yes if the student received a 70 or higher on their math exam. If they did not, have it say no.
Hint: Think of if else as:
If this / Then this / Else this /
# A tibble: 1,000 × 9
gender `race/ethnicity` paren…¹ lunch test …² math_…³ readi…⁴ writi…⁵ math_…⁶
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 female group B bachel… stan… none 72 72 74 Yes
2 female group C some c… stan… comple… 69 90 88 No
3 female group B master… stan… none 90 95 93 Yes
4 male group A associ… free… none 47 57 44 No
5 male group C some c… stan… none 76 78 75 Yes
6 female group B associ… stan… none 71 83 78 Yes
7 female group B some c… stan… comple… 88 95 92 Yes
8 male group B some c… free… none 40 43 39 No
9 male group D high s… free… comple… 64 64 67 No
10 female group B high s… free… none 38 60 50 No
# … with 990 more rows, and abbreviated variable names
# ¹parental_level_of_education, ²`test preparation course`, ³math_score,
# ⁴reading_score, ⁵writing_score, ⁶math_pass
Now, use mutate to make gender a factor.
# A tibble: 1,000 × 8
gender `race/ethnicity` parental_leve…¹ lunch test …² math_…³ readi…⁴ writi…⁵
<fct> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 female group B bachelor's deg… stan… none 72 72 74
2 female group C some college stan… comple… 69 90 88
3 female group B master's degree stan… none 90 95 93
4 male group A associate's de… free… none 47 57 44
5 male group C some college stan… none 76 78 75
6 female group B associate's de… stan… none 71 83 78
7 female group B some college stan… comple… 88 95 92
8 male group B some college free… none 40 43 39
9 male group D high school free… comple… 64 64 67
10 female group B high school free… none 38 60 50
# … with 990 more rows, and abbreviated variable names
# ¹parental_level_of_education, ²`test preparation course`, ³math_score,
# ⁴reading_score, ⁵writing_score
select()
changes whether or not a column is included.
slice()
chooses rows based on location.
Now, only display the first 5 rows of the three exam score columns.
# A tibble: 5 × 3
math_score reading_score writing_score
<dbl> <dbl> <dbl>
1 72 72 74
2 69 90 88
3 90 95 93
4 47 57 44
5 76 78 75
Note: You can combine with functions like head and tail to look at data too!
group_by()
perform calculations separately for each value of a variable
summarise()
collapses a group into a single row
Now, group students by their parental level of education and calculate their mean math score. Arrange this in descending order.
studentexams |>
group_by(parental_level_of_education) |>
summarise(mean_math = mean(math_score)) |>
arrange(desc(mean_math))
# A tibble: 6 × 2
parental_level_of_education mean_math
<chr> <dbl>
1 master's degree 69.7
2 bachelor's degree 69.4
3 associate's degree 67.9
4 some college 67.1
5 some high school 63.5
6 high school 62.1
Your turn!
Ask a question about these data and answer it. Create appropriate plots to help answer your question.