Tidying data

Lecture 7

Dr. Elijah Meyer + Konnie Huang

Duke University
STA 199 - Fall 2022

September 19, 2022

Checklist

  • Open your ae-06 project in RStudio (that you already started on Tuesday), render your document, and commit and push.

  • Any questions from prepare materials? Go to slido. You can also upvote others’ questions.

  • Lab 02 Due Tonight.

  • Start Early. Render Often. Ask Questions.

  • Groups are coming after Exam 1.

Goals:

– Understand pivot_longer

– ggplot practice

– Practice re-creating graphs

– “New” functions: if_else / scale_continuous_x

Warm Up: Tidying datasets

What makes a dataset “tidy”?

03:00

Tidy Data

There are three interrelated rules that make a dataset tidy:

  • Each variable is a column; each column is a variable.

  • Each observation is row; each row is an observation.

  • Each value is a cell; each cell is a single value.

Application exercise

ae-06

  • Go to the course GitHub org and find your ae-06 (repo name will be suffixed with your GitHub name).
  • Clone the repo in your container, open the Quarto document in the repo, and follow along and complete the exercises.
  • Render, commit, and push your edits by the AE deadline – 3 days from today.

Recap of AE

  • When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do so within the pivot_longer() function.

  • You can tweak a plot forever, but at some point the tweaks are likely not very productive. However, you should always be critical of defaults (however pretty they might be) and see if you can improve the plot to better portray your data / results / what you want to communicate.

  • pivot_wider() which makes datasets wider by increasing columns and reducing rows. pivot_wider() has the opposite interface to pivot_longer(): we need to provide the existing columns that define the values (values_from) and the column name (names_from).