Best practices for rectangular data

Starts at:

Thursday, 13:30

Objectives

💡 You can apply the 12 rules of rectangular data to organize research datasets effectively.
💡 You understand the principles of tidy data and can identify when data meets tidy data criteria.
💡 You can convert between wide and long data formats.
💡 You can implement data validation techniques to detect and prevent common data entry errors.
💡 You can apply best practices for file naming and data organization in research projects.
💡 You can identify and fix problems such as empty cells, inconsistent formatting, and mixed data types.
💡 You understand the importance of data dictionaries and can create them for your datasets.

Exercises

Exercise 1: Identifying rectangular data rules

Look at the following dataset examples and identify which rules of rectangular data are being violated.
For each violation, explain how you would fix it.

Dataset A

Download the data

  ID Age Gender Measurement1           Measurement2
1  1  25   Male      15.2 cm                 Normal
2  2  30      F         18.7 High (above threshold)
3  3  NA female         12.1                       
4  4  28   Male         14.9                    Low
5  5  32      M      16.8 cm                 Normal
6  6  29 Female         13.4          Low (recheck)

Dataset B

Download the data

  Participant Day1_Sleep Day1_Exercise Day2_Sleep Day2_Exercise
1        P001        7.5           Yes        6.0            No
2        P002        8.0            No        8.5           Yes
3        P003        6.5           Yes        7.0           Yes
4        P004        7.2            No         NA            No
5        P005        6.8           Yes        7.8           Yes
               Notes
1      Sick on day 2
2                   
3  Traveled on day 1
4 Missing sleep data
5

Exercise 2: Wide vs. long data conversion

Download the data

Given the following wide-format dataset about students’ test scores:

  student_id math_score science_score english_score
1       S001         85            90            82
2       S002         92            88            90
3       S003         78            95            80
4       S004         88            85            87
5       S005         79            92            85

Explain what this dataset would look like in long format. What are the names of the columns?
Convert this dataset to long format using pivot_longer() in .
Convert it back to wide format using pivot_wider() in .
Explain when you would prefer wide format vs. long format for this type of data.

Exercise 3: Data validation with assertr in R

Exercise 3: Data validation with `assertr` in

Download the data

You have collected reaction time data from a psychological experiment. Write validation rules using the assertr package in to check for the following conditions:

  participant condition reaction_time accuracy
1        P001   control           450     0.95
2        P002 treatment           380     0.88
3        P003   control           520     0.92
4        P004 treatment            25     0.85
5        P005   control           490     1.20
6        P006 treatment          2500     0.90
7        P007   invalid           420     0.89

In in , write verify() statements to check that:

All reaction times are between 200 and 2000 milliseconds
All accuracy values are between 0 and 1
All participant IDs start with “P”
The condition column only contains “control” or “treatment”

Test your validation rules with the sample dataset. What errors do you find?

Exercise 4: Tidy data principles

Download the data

Examine the following dataset and determine whether it follows tidy data principles:

  Subject Time_Point Heart_Rate_Baseline Heart_Rate_Stress Blood_Pressure_Sys
1       1      Week1                  72                85                120
2       1      Week2                  70                88                118
3       2      Week1                  68                82                125
4       2      Week2                  69                84                122
5       3      Week1                  74                90                115
6       3      Week2                  73                87                117
  Blood_Pressure_Dia
1                 80
2                 78
3                 82
4                 79
5                 75
6                 77

Is this dataset tidy? If not, what principles are violated?
Reorganize this dataset to make it tidy. You may need to create multiple datasets if necessary.
Explain your reasoning for the restructuring you chose.

Exercise 5: Data organization best practices

You are starting a new research project studying the effects of different teaching methods on student performance. You plan to collect data from 3 schools, with 2 different teaching methods, over 4 time points, measuring math and reading scores.

Design the structure for your main dataset following rectangular data principles. Include:
- Appropriate variable names
- A clear layout (wide or long format - justify your choice)
- How you would handle missing data
- What data validation rules you would implement
Create a small example dataset (5-6 rows) that demonstrates your design.
Write a brief data dictionary for your variables.

Exercise 6: File naming and organization

You have the following research files that need to be renamed according to best practices:

final_results.xlsx
data backup.csv
Results-UPDATED-2024.csv
Math scores (corrected).xlsx
science_scores_final_FINAL.csv
reading data - Jan 15.txt

Propose new file names that follow consistent naming conventions.
Explain the naming convention you chose and why.
How would you organize these files in a folder structure?

Exercise 7: Fixing common data problems

Download the data

The following dataset contains several common data entry errors. Identify and fix them:

##   ID         Age Gender Score        Date                 Notes
## 1  1          25   Male  85.5  2024-01-15                Normal
## 2  2          30      F  92.0  15/01/2024                      
## 3  3 Twenty-five female  78.2 Jan 15 2024  Outlier - check data
## 4  4          28      M  88.7  2024-01-17                Normal
## 5  5          35 Female 95.5*  2024/01/18 Equipment malfunction
## 6  6               male        2024-01-20          Missing data
## 7  7          29 FEMALE 102.3  2024-01-21     Score above range

List all the problems you can identify in this dataset.
Clean the dataset by fixing these issues.
Write validation rules that would catch these types of errors in the future.

How to use these exercises

Each exercise is self-contained and can be completed independently
Some exercises build upon earlier ones, so working through them in order is recommended
Make sure you have the required R packages installed before starting
Sample data files are located in the data/ subdirectory

Required packages

install.packages(c("tidyr", "dplyr", "assertr", "readr"))

Slides

How can I download the slides as a PDF file?

To export the slides to PDF, do the following:

Toggle into Print View using the E key (or using the Navigation Menu).
Open the in-browser print dialog (CTRL/CMD+P).
Change the Destination setting to Save as PDF.
Change the Layout to Landscape.
Change the Margins to None.
Enable the Background graphics option.
Click Save.

Note: This feature has been confirmed to work in Google Chrome, Chromium as well as in Firefox.

These instructions were copied from the Quarto documentation (MIT License) and slightly modified.

Resources

dplyr documentation
tidyr documentation
assertr package vignette
Broman & Woo (2018) “Data Organization in Spreadsheets”

References

Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2–10. https://doi.org/10.1080/00031305.2017.1375989.