Exam Review

Application exercise
Answers

Packages

We will use the following two packages in this application exercise.

  • tidyverse: For data import, wrangling, and visualization.

For the remaining time, we will practice data wrangling with dplyr. We will be using the Student Exams data set. This is fictional data. The dataset is not from a real context, and the data set does not represent real people. The purpose of this data set is to teach data science and practice using R functions.

studentexams <- read_csv("data/StudentsPerformance.csv")
Rows: 1000 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): gender, race/ethnicity, parental level of education, lunch, test pr...
dbl (3): math score, reading score, writing score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First, let’s take a glimpse at our data.

glimpse(studentexams)
Rows: 1,000
Columns: 8
$ gender                        <chr> "female", "female", "female", "male", "m…
$ `race/ethnicity`              <chr> "group B", "group C", "group B", "group …
$ `parental level of education` <chr> "bachelor's degree", "some college", "ma…
$ lunch                         <chr> "standard", "standard", "standard", "fre…
$ `test preparation course`     <chr> "none", "completed", "none", "none", "no…
$ `math score`                  <dbl> 72, 69, 90, 47, 76, 71, 88, 40, 64, 38, …
$ `reading score`               <dbl> 72, 90, 95, 57, 78, 83, 95, 43, 64, 60, …
$ `writing score`               <dbl> 74, 88, 93, 44, 75, 78, 92, 39, 67, 50, …

Identify the variable names. Identify their type.

Inline code example There are 1000 number of rows in the data set and 8 columns in the data set.

Some variable names have spaces. This won’t work. Let’s clean these up using rename.

rename() changes the name of columns.

studentexams <- rename(studentexams, math_score = `math score`,
                                     reading_score = `reading score`, 
                                     writing_score = `writing score`,
                                  parental_level_of_education = `parental level of education`)

filter() :chooses rows based on column values.

Filter these data so that it only contains rows where math scores are at or equal to 70.

Then, filter the data to only look at student performance if they received standard lunch.

studentexams |>
  filter(math_score >= 70)
# A tibble: 409 × 8
   gender `race/ethnicity` parental_leve…¹ lunch test …² math_…³ readi…⁴ writi…⁵
   <chr>  <chr>            <chr>           <chr> <chr>     <dbl>   <dbl>   <dbl>
 1 female group B          bachelor's deg… stan… none         72      72      74
 2 female group B          master's degree stan… none         90      95      93
 3 male   group C          some college    stan… none         76      78      75
 4 female group B          associate's de… stan… none         71      83      78
 5 female group B          some college    stan… comple…      88      95      92
 6 male   group A          some college    stan… comple…      78      72      70
 7 male   group C          high school     stan… none         88      89      86
 8 male   group D          bachelor's deg… free… comple…      74      71      80
 9 male   group A          master's degree free… none         73      74      72
10 male   group C          high school     stan… none         70      70      65
# … with 399 more rows, and abbreviated variable names
#   ¹​parental_level_of_education, ²​`test preparation course`, ³​math_score,
#   ⁴​reading_score, ⁵​writing_score
studentexams |> 
  filter(lunch == "standard")
# A tibble: 645 × 8
   gender `race/ethnicity` parental_leve…¹ lunch test …² math_…³ readi…⁴ writi…⁵
   <chr>  <chr>            <chr>           <chr> <chr>     <dbl>   <dbl>   <dbl>
 1 female group B          bachelor's deg… stan… none         72      72      74
 2 female group C          some college    stan… comple…      69      90      88
 3 female group B          master's degree stan… none         90      95      93
 4 male   group C          some college    stan… none         76      78      75
 5 female group B          associate's de… stan… none         71      83      78
 6 female group B          some college    stan… comple…      88      95      92
 7 male   group C          associate's de… stan… none         58      54      52
 8 male   group D          associate's de… stan… none         40      52      43
 9 female group B          high school     stan… none         65      81      73
10 male   group A          some college    stan… comple…      78      72      70
# … with 635 more rows, and abbreviated variable names
#   ¹​parental_level_of_education, ²​`test preparation course`, ³​math_score,
#   ⁴​reading_score, ⁵​writing_score

Why does == work here but not in the homework? Check Sakai for a more detailed explanation!

mutate() changes the values of columns and creates new columns. Let’s use this with if else to create a new variable. Create a new variable called math_pass. Have it display yes if the student received a 70 or higher on their math exam. If they did not, have it say no.

Hint: Think of if else as:

If this / Then this / Else this /

studentexams |> 
  mutate(
    math_pass = if_else(math_score >= 70, "Yes", "No")
  )
# A tibble: 1,000 × 9
   gender `race/ethnicity` paren…¹ lunch test …² math_…³ readi…⁴ writi…⁵ math_…⁶
   <chr>  <chr>            <chr>   <chr> <chr>     <dbl>   <dbl>   <dbl> <chr>  
 1 female group B          bachel… stan… none         72      72      74 Yes    
 2 female group C          some c… stan… comple…      69      90      88 No     
 3 female group B          master… stan… none         90      95      93 Yes    
 4 male   group A          associ… free… none         47      57      44 No     
 5 male   group C          some c… stan… none         76      78      75 Yes    
 6 female group B          associ… stan… none         71      83      78 Yes    
 7 female group B          some c… stan… comple…      88      95      92 Yes    
 8 male   group B          some c… free… none         40      43      39 No     
 9 male   group D          high s… free… comple…      64      64      67 No     
10 female group B          high s… free… none         38      60      50 No     
# … with 990 more rows, and abbreviated variable names
#   ¹​parental_level_of_education, ²​`test preparation course`, ³​math_score,
#   ⁴​reading_score, ⁵​writing_score, ⁶​math_pass

Now, use mutate to make gender a factor.

studentexams|>
  mutate(gender = as.factor(gender))
# A tibble: 1,000 × 8
   gender `race/ethnicity` parental_leve…¹ lunch test …² math_…³ readi…⁴ writi…⁵
   <fct>  <chr>            <chr>           <chr> <chr>     <dbl>   <dbl>   <dbl>
 1 female group B          bachelor's deg… stan… none         72      72      74
 2 female group C          some college    stan… comple…      69      90      88
 3 female group B          master's degree stan… none         90      95      93
 4 male   group A          associate's de… free… none         47      57      44
 5 male   group C          some college    stan… none         76      78      75
 6 female group B          associate's de… stan… none         71      83      78
 7 female group B          some college    stan… comple…      88      95      92
 8 male   group B          some college    free… none         40      43      39
 9 male   group D          high school     free… comple…      64      64      67
10 female group B          high school     free… none         38      60      50
# … with 990 more rows, and abbreviated variable names
#   ¹​parental_level_of_education, ²​`test preparation course`, ³​math_score,
#   ⁴​reading_score, ⁵​writing_score

select() changes whether or not a column is included.

slice() chooses rows based on location.

Now, only display the first 5 rows of the three exam score columns.

studentexams |>
  select(math_score, reading_score, writing_score) |>
  slice(1:5)
# A tibble: 5 × 3
  math_score reading_score writing_score
       <dbl>         <dbl>         <dbl>
1         72            72            74
2         69            90            88
3         90            95            93
4         47            57            44
5         76            78            75

Note: You can combine with functions like head and tail to look at data too!

group_by() perform calculations separately for each value of a variable

summarise() collapses a group into a single row

Now, group students by their parental level of education and calculate their mean math score. Arrange this in descending order.

studentexams |>
  group_by(parental_level_of_education) |>
  summarise(mean_math = mean(math_score)) |>
  arrange(desc(mean_math))
# A tibble: 6 × 2
  parental_level_of_education mean_math
  <chr>                           <dbl>
1 master's degree                  69.7
2 bachelor's degree                69.4
3 associate's degree               67.9
4 some college                     67.1
5 some high school                 63.5
6 high school                      62.1

Your turn!

Ask a question about these data and answer it. Create appropriate plots to help answer your question.