ID Age Gender Measurement1 Measurement2
1 1 25 Male 15.2 cm Normal
2 2 30 F 18.7 High (above threshold)
3 3 NA female 12.1
4 4 28 Male 14.9 Low
5 5 32 M 16.8 cm Normal
6 6 29 Female 13.4 Low (recheck)
Objectives
💡 You can apply the 12 rules of rectangular data to organize research datasets effectively.
💡 You understand the principles of tidy data and can identify when data meets tidy data criteria.
💡 You can convert between wide and long data formats.
💡 You can implement data validation techniques to detect and prevent common data entry errors.
💡 You can apply best practices for file naming and data organization in research projects.
💡 You can identify and fix problems such as empty cells, inconsistent formatting, and mixed data types.
💡 You understand the importance of data dictionaries and can create them for your datasets.
Exercises
Exercise 1: Identifying rectangular data rules
- Look at the following dataset examples and identify which rules of rectangular data are being violated.
- For each violation, explain how you would fix it.
Dataset A
Dataset B
Participant Day1_Sleep Day1_Exercise Day2_Sleep Day2_Exercise
1 P001 7.5 Yes 6.0 No
2 P002 8.0 No 8.5 Yes
3 P003 6.5 Yes 7.0 Yes
4 P004 7.2 No NA No
5 P005 6.8 Yes 7.8 Yes
Notes
1 Sick on day 2
2
3 Traveled on day 1
4 Missing sleep data
5
Exercise 2: Wide vs. long data conversion
Given the following wide-format dataset about students’ test scores:
student_id math_score science_score english_score
1 S001 85 90 82
2 S002 92 88 90
3 S003 78 95 80
4 S004 88 85 87
5 S005 79 92 85
- Explain what this dataset would look like in long format. What are the names of the columns?
- Convert this dataset to long format using
pivot_longer()in . - Convert it back to wide format using
pivot_wider()in . - Explain when you would prefer wide format vs. long format for this type of data.
Exercise 3: Data validation with assertr in
You have collected reaction time data from a psychological experiment. Write validation rules using the assertr package in to check for the following conditions:
participant condition reaction_time accuracy
1 P001 control 450 0.95
2 P002 treatment 380 0.88
3 P003 control 520 0.92
4 P004 treatment 25 0.85
5 P005 control 490 1.20
6 P006 treatment 2500 0.90
7 P007 invalid 420 0.89
In in , write verify() statements to check that:
- All reaction times are between 200 and 2000 milliseconds
- All accuracy values are between 0 and 1
- All participant IDs start with “P”
- The condition column only contains “control” or “treatment”
Test your validation rules with the sample dataset. What errors do you find?
Exercise 4: Tidy data principles
Examine the following dataset and determine whether it follows tidy data principles:
Subject Time_Point Heart_Rate_Baseline Heart_Rate_Stress Blood_Pressure_Sys
1 1 Week1 72 85 120
2 1 Week2 70 88 118
3 2 Week1 68 82 125
4 2 Week2 69 84 122
5 3 Week1 74 90 115
6 3 Week2 73 87 117
Blood_Pressure_Dia
1 80
2 78
3 82
4 79
5 75
6 77
- Is this dataset tidy? If not, what principles are violated?
- Reorganize this dataset to make it tidy. You may need to create multiple datasets if necessary.
- Explain your reasoning for the restructuring you chose.
Exercise 5: Data organization best practices
You are starting a new research project studying the effects of different teaching methods on student performance. You plan to collect data from 3 schools, with 2 different teaching methods, over 4 time points, measuring math and reading scores.
Design the structure for your main dataset following rectangular data principles. Include:
- Appropriate variable names
- A clear layout (wide or long format - justify your choice)
- How you would handle missing data
- What data validation rules you would implement
Create a small example dataset (5-6 rows) that demonstrates your design.
Write a brief data dictionary for your variables.
Exercise 6: File naming and organization
You have the following research files that need to be renamed according to best practices:
final_results.xlsx
data backup.csv
Results-UPDATED-2024.csv
Math scores (corrected).xlsx
science_scores_final_FINAL.csv
reading data - Jan 15.txt
- Propose new file names that follow consistent naming conventions.
- Explain the naming convention you chose and why.
- How would you organize these files in a folder structure?
Exercise 7: Fixing common data problems
The following dataset contains several common data entry errors. Identify and fix them:
## ID Age Gender Score Date Notes
## 1 1 25 Male 85.5 2024-01-15 Normal
## 2 2 30 F 92.0 15/01/2024
## 3 3 Twenty-five female 78.2 Jan 15 2024 Outlier - check data
## 4 4 28 M 88.7 2024-01-17 Normal
## 5 5 35 Female 95.5* 2024/01/18 Equipment malfunction
## 6 6 male 2024-01-20 Missing data
## 7 7 29 FEMALE 102.3 2024-01-21 Score above range
- List all the problems you can identify in this dataset.
- Clean the dataset by fixing these issues.
- Write validation rules that would catch these types of errors in the future.
How to use these exercises
- Each exercise is self-contained and can be completed independently
- Some exercises build upon earlier ones, so working through them in order is recommended
- Make sure you have the required R packages installed before starting
- Sample data files are located in the
data/subdirectory
Required packages
install.packages(c("tidyr", "dplyr", "assertr", "readr"))Slides
To export the slides to PDF, do the following:
- Toggle into Print View using the E key (or using the Navigation Menu).
- Open the in-browser print dialog (CTRL/CMD+P).
- Change the Destination setting to Save as PDF.
- Change the Layout to Landscape.
- Change the Margins to None.
- Enable the Background graphics option.
- Click Save.
Note: This feature has been confirmed to work in Google Chrome, Chromium as well as in Firefox.
These instructions were copied from the Quarto documentation (MIT License) and slightly modified.
Resources
- dplyr documentation
- tidyr documentation
- assertr package vignette
- Broman & Woo (2018) “Data Organization in Spreadsheets”