• Home
  • About
  • Schedule
  • Sessions
  • Code of Conduct

Contents

  • Objectives
  • Exercises
    • How to use these exercises
    • Required packages
  • Slides
  • Resources

Best practices for rectangular data

Starts at:

Thursday, 13:30

Slides

Objectives

💡 You can apply the 12 rules of rectangular data to organize research datasets effectively.
💡 You understand the principles of tidy data and can identify when data meets tidy data criteria.
💡 You can convert between wide and long data formats.
💡 You can implement data validation techniques to detect and prevent common data entry errors.
💡 You can apply best practices for file naming and data organization in research projects.
💡 You can identify and fix problems such as empty cells, inconsistent formatting, and mixed data types.
💡 You understand the importance of data dictionaries and can create them for your datasets.

Exercises

NoteExercise 1: Identifying rectangular data rules

Exercise 1: Identifying rectangular data rules

  1. Look at the following dataset examples and identify which rules of rectangular data are being violated.
  2. For each violation, explain how you would fix it.

Dataset A

Download the data

  ID Age Gender Measurement1           Measurement2
1  1  25   Male      15.2 cm                 Normal
2  2  30      F         18.7 High (above threshold)
3  3  NA female         12.1                       
4  4  28   Male         14.9                    Low
5  5  32      M      16.8 cm                 Normal
6  6  29 Female         13.4          Low (recheck)

Dataset B

Download the data

  Participant Day1_Sleep Day1_Exercise Day2_Sleep Day2_Exercise
1        P001        7.5           Yes        6.0            No
2        P002        8.0            No        8.5           Yes
3        P003        6.5           Yes        7.0           Yes
4        P004        7.2            No         NA            No
5        P005        6.8           Yes        7.8           Yes
               Notes
1      Sick on day 2
2                   
3  Traveled on day 1
4 Missing sleep data
5                   
NoteExercise 2: Wide vs. long data conversion

Exercise 2: Wide vs. long data conversion

Download the data

Given the following wide-format dataset about students’ test scores:

  student_id math_score science_score english_score
1       S001         85            90            82
2       S002         92            88            90
3       S003         78            95            80
4       S004         88            85            87
5       S005         79            92            85
  1. Explain what this dataset would look like in long format. What are the names of the columns?
  2. Convert this dataset to long format using pivot_longer() in .
  3. Convert it back to wide format using pivot_wider() in .
  4. Explain when you would prefer wide format vs. long format for this type of data.
NoteExercise 3: Data validation with assertr in R

Exercise 3: Data validation with assertr in

Download the data

You have collected reaction time data from a psychological experiment. Write validation rules using the assertr package in to check for the following conditions:

  participant condition reaction_time accuracy
1        P001   control           450     0.95
2        P002 treatment           380     0.88
3        P003   control           520     0.92
4        P004 treatment            25     0.85
5        P005   control           490     1.20
6        P006 treatment          2500     0.90
7        P007   invalid           420     0.89

In in , write verify() statements to check that:

  1. All reaction times are between 200 and 2000 milliseconds
  2. All accuracy values are between 0 and 1
  3. All participant IDs start with “P”
  4. The condition column only contains “control” or “treatment”

Test your validation rules with the sample dataset. What errors do you find?

NoteExercise 4: Tidy data principles

Exercise 4: Tidy data principles

Download the data

Examine the following dataset and determine whether it follows tidy data principles:

  Subject Time_Point Heart_Rate_Baseline Heart_Rate_Stress Blood_Pressure_Sys
1       1      Week1                  72                85                120
2       1      Week2                  70                88                118
3       2      Week1                  68                82                125
4       2      Week2                  69                84                122
5       3      Week1                  74                90                115
6       3      Week2                  73                87                117
  Blood_Pressure_Dia
1                 80
2                 78
3                 82
4                 79
5                 75
6                 77
  1. Is this dataset tidy? If not, what principles are violated?
  2. Reorganize this dataset to make it tidy. You may need to create multiple datasets if necessary.
  3. Explain your reasoning for the restructuring you chose.
NoteExercise 5: Data organization best practices

Exercise 5: Data organization best practices

You are starting a new research project studying the effects of different teaching methods on student performance. You plan to collect data from 3 schools, with 2 different teaching methods, over 4 time points, measuring math and reading scores.

  1. Design the structure for your main dataset following rectangular data principles. Include:

    • Appropriate variable names
    • A clear layout (wide or long format - justify your choice)
    • How you would handle missing data
    • What data validation rules you would implement
  2. Create a small example dataset (5-6 rows) that demonstrates your design.

  3. Write a brief data dictionary for your variables.

NoteExercise 6: File naming and organization

Exercise 6: File naming and organization

You have the following research files that need to be renamed according to best practices:

final_results.xlsx
data backup.csv
Results-UPDATED-2024.csv
Math scores (corrected).xlsx
science_scores_final_FINAL.csv
reading data - Jan 15.txt
  1. Propose new file names that follow consistent naming conventions.
  2. Explain the naming convention you chose and why.
  3. How would you organize these files in a folder structure?
NoteExercise 7: Fixing common data problems

Exercise 7: Fixing common data problems

Download the data

The following dataset contains several common data entry errors. Identify and fix them:

##   ID         Age Gender Score        Date                 Notes
## 1  1          25   Male  85.5  2024-01-15                Normal
## 2  2          30      F  92.0  15/01/2024                      
## 3  3 Twenty-five female  78.2 Jan 15 2024  Outlier - check data
## 4  4          28      M  88.7  2024-01-17                Normal
## 5  5          35 Female 95.5*  2024/01/18 Equipment malfunction
## 6  6               male        2024-01-20          Missing data
## 7  7          29 FEMALE 102.3  2024-01-21     Score above range
  1. List all the problems you can identify in this dataset.
  2. Clean the dataset by fixing these issues.
  3. Write validation rules that would catch these types of errors in the future.

How to use these exercises

  • Each exercise is self-contained and can be completed independently
  • Some exercises build upon earlier ones, so working through them in order is recommended
  • Make sure you have the required R packages installed before starting
  • Sample data files are located in the data/ subdirectory

Required packages

install.packages(c("tidyr", "dplyr", "assertr", "readr"))

Slides

NoteHow can I download the slides as a PDF file?

To export the slides to PDF, do the following:

  1. Toggle into Print View using the E key (or using the Navigation Menu).
  2. Open the in-browser print dialog (CTRL/CMD+P).
  3. Change the Destination setting to Save as PDF.
  4. Change the Layout to Landscape.
  5. Change the Margins to None.
  6. Enable the Background graphics option.
  7. Click Save.

Note: This feature has been confirmed to work in Google Chrome, Chromium as well as in Firefox.

These instructions were copied from the Quarto documentation (MIT License) and slightly modified.

Resources

  • dplyr documentation
  • tidyr documentation
  • assertr package vignette
  • Broman & Woo (2018) “Data Organization in Spreadsheets”

References

Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2–10. https://doi.org/10.1080/00031305.2017.1375989.

© 2026 Dr. Lennart Wittkuhn

 

License: CC BY 4.0