Best practices for rectangular data

Research Data Management for Psychology and Neuroscience
Course at University of Hamburg, RTG 2753: Emotional Learning and Memory
Slides | Source
License: CC BY 4.0

14:00

Schedule

Day Date Time Title
1 2026-02-05 09:30 - 10:00 Welcome and Introduction to Research Data Management
1 2026-02-05 10:00 - 11:00 Project & Data Organization
1 2026-02-05 11:00 - 12:00 Data Management Plans (DMPs)
1 2026-02-05 12:00 - 13:00 Lunch Break
1 2026-02-05 13:00 - 14:00 Command Line
1 2026-02-05 14:00 - 15:00 Best practices for rectangular data
1 2026-02-05 15:00 - 16:30 Brain Imaging Data Structure (BIDS)

This session: Best practices for rectangular data

Illustration from the Openscapes blog “Tidy Data for reproducibility, efficiency, and collaboration” by Julia Lowndes and Allison Horst (CC-BY 4.0)

Objectives

💡 You can apply the 12 rules of rectangular data to organize research datasets effectively.
💡 You understand the principles of tidy data and can identify when data meets tidy data criteria.
💡 You can convert between wide and long data formats.
💡 You can implement data validation techniques to detect and prevent common data entry errors.
💡 You can apply best practices for file naming and data organization in research projects.
💡 You can identify and fix data problems such as empty cells, inconsistent formatting, and mixed data types.
💡 You understand the importance of data dictionaries and can create them for your datasets.

1 Rules of rectangular data

Rectangular data

If you do not have a community standard specifying the data organization (and you have tabular data), we highly recommend using the rules of rectangular data as proposed by Broman & Woo (2018).

Rules of rectangular data (according to Broman & Woo, 2018)

Mini exercises: What are the rules?

  1. Be consistent.
  2. Choose good names for things (see “Project & Data Organization”).
  3. Write dates in the format YYYY-MM-DD (see “Project & Data Organization”).
  4. Do not leave empty cells.
  5. Put just one thing in a cell.
  6. Make it a rectangle.
  7. Create a data dictionary.
  8. Do not perform calculations in the raw data files.
  9. Do not use font color or highlighting as data.
  10. Make backups.
  11. Use data validation to avoid errors.
  12. Save the data in plain text files.

Be consistent

It sounds easier than it is. But if you organize your data consistently from the start, you will not have to spend additional time later “harmonizing” the data.

Use consistent codes for:

  • Categorical variables (decide between male, Male, m, M)
  • Missing values (prefer NA or similar, avoid e.g., -999)
  • Variable names (decide between saliva_10wk, Saliva_10wk, sal_week10)
  • Subject identifiers (decide between 003, pcp003, person-003)

Additional consistency rules:

  • Data layout across multiple files
  • File names (decide between TSST_VR_2024-11-19.csv, 2024-11-19_TSST_virtual-reality.csv)
  • Date format for all dates
  • Extra spaces within cells (“male” vs ” male “)

No empty cells

❌ Don’t do this:

✅ Do this instead:

Do fill out every cell. When information is missing, use a common code to indicate that it is missing (preferably NA).

One thing per cell

❌ Don’t do this:

✅ Do this instead:

  • In one piece of a spreadsheet, which is a cell, there should only be one piece of information.
  • Do not include units in your cells. It is better to put units in a data dictionary.
  • The same applies to notes. Instead of writing 0 (below threshold), create a new column called note and write 0 in the first column and below threshold in the second.

Make it a rectangle

  • Use columns for variables and rows for subjects or observations
  • The first row should contain variable names
  • If some data do not fit into one dataset, create a set of rectangular datasets and save them in separate files
  • Do not use multiple header rows

Example of multiple header rows

A B C D E
day_1 day_2
ID sleep sport sleep sport
34 7.5 3 6 0.5
35 8 0 8.5 0.5
36 6 2.5 7.5 3

Note: This table deviates from the rectangle form. It also leaves multiple cells empty.

More rules

No calculations in raw data

  • Primary data should just be data. Only data.
  • There should be no means and standard deviations calculated in that primary data.
  • Use scripts to calculate whatever you want, but do not make changes in the primary dataset.

No font color or highlighting as data

  • If you identify outliers or other information you want to highlight, do not highlight them using visualization.
  • Instead, create a new column called outlier and mark the identified outliers as TRUE and the others as FALSE.
  • Visualization is useful in the short term, but it makes it difficult to extract this information for later analysis.

Make backups

Back up your data regularly in multiple locations.

  • Consider using a version control system like Git
  • When you have finished entering data, write-protect your data file
  • This ensures that you do not accidentally make changes to your dataset

Save data in plain text files

  • Save your data files in e.g., .csv format (“comma-separated values”)
  • Maximum compatibility
  • Reproducibility across platforms
  • Future-proof
  • In countries where commas are used as decimal separators, tab-delimited text files (.tsv) might be an appropriate alternative to .csv

2 Tidy data

What is tidy data?

Illustration from the Openscapes blog “Tidy Data for reproducibility, efficiency, and collaboration” by Julia Lowndes and Allison Horst (CC-BY 4.0)

Tidy data rules

  1. Each variable is a column, and each column is a variable.
  2. Each observation is a row, and each row is an observation.
  3. Each value is a cell, and each cell contains a value.

Data structures: wide vs long format

Data can be structured in different ways. When your data is tidy, you may still encounter wide or long formats.

Wide format

  • Each participant is one row
  • Variables spread across columns
  • Good for summary tables

Long format

  • Each observation is one row
  • Variables stacked vertically
  • Good for analysis

Transforming data

pivot_longer() in : wide to long

library(tidyr)
data_wide |>
  pivot_longer(
    cols = c("congruent", "incongruent"),
    names_to = "congruency",
    values_to = "reaction_time"
  )

pivot_wider() in : long to wide

data_long |>
  pivot_wider(
    names_from = "congruency",
    values_from = "reaction_time"
  )

Validate your data

Ensure that your entered data is error-free by applying data validation techniques. Use packages like assertr in (Fischetti, 2023) to validate your data:

✅ Valid data

library(assertr)
data_long |>
  verify(reaction_time > 200) |>
  verify(participant %in% 1:100) |>
  assert(in_set("congruent", "incongruent"), congruency)

❌ Invalid data (demonstrates failure)

library(assertr)
# Add a problematic row to demonstrate assertr failure
data_long <- rbind(data_long, data.frame(
  participant = 4,
  congruency = "neutral",  
  reaction_time = 150 # This will cause assert to fail
))
data_long
try({
  data_long |>
    verify(reaction_time > 200) |>
    verify(participant %in% 1:100) |>
    assert(in_set("congruent", "incongruent"), congruency)
})
verification [reaction_time > 200] failed! (1 failure)

    verb redux_fn           predicate column index value
1 verify       NA reaction_time > 200     NA     7    NA

Error : assertr stopped execution

3 Data dictionaries

What is a data dictionary?

  • In addition to rectangular data, it is also valuable to have a data dictionary describing how your data is structured.
  • The data dictionary is also sometimes referred to as a codebook.

What should a data dictionary contain?

According to Broman & Woo (2018), your dictionary should contain:

  1. The exact variable name as in your data file
  2. A version of the variable name that might be used in data visualizations
  3. A longer explanation of what the variable means
  4. The measurement units
  5. Expected minimum and maximum values

Additional information for survey data

When analyzing data collected from a survey, a variable in your dataset will likely represent an item from that survey:

  1. The item in the survey
  2. The original wording of the item
  3. The subscale the item belongs to
  4. The author responsible for that item/subscale
  5. The response format for the item
  6. Special considerations regarding the item

Data dictionary format

  • A recommended option for a data directory structure is to use a .json file.
  • json stands for JavaScript Object Notation.
[
  {
    "name": "agr1",
    "item_wording": "I make people feel at ease.",
    "type": "numeric",
    "scale": "agreeableness",
    "min_value": 1,
    "max_value": 5
  },
  {
    "name": "agr2",
    "item_wording": "I love children.",
    "type": "numeric",
    "scale": "agreeableness",
    "min_value": 1,
    "max_value": 5
  }
]

Benefits of JSON files

Providing metadata in a json format has some useful advantages:

  • It is easy for humans and especially machines to read
  • You can read in the json file into and use the data information for your data analyses
  • In comparison to spreadsheets, json files are not limited to two-dimensional in- and outputs

Read json-file into

library(jsonlite)
jsonlite::fromJSON(here::here("sessions", "rectangular-data", "codebook.json"))

4 Exercises

Exercise 1

Exercise 1: Identifying rectangular data rules

  1. Look at the following dataset examples and identify which rules of rectangular data are being violated.
  2. For each violation, explain how you would fix it.

Dataset A

Download the data

Dataset B

Download the data

Exercise 2

Exercise 2: Wide vs. long data conversion

Download the data

Given the following wide-format dataset about students’ test scores:

  1. Explain what this dataset would look like in long format. What are the names of the columns?
  2. Convert this dataset to long format using pivot_longer() in .
  3. Convert it back to wide format using pivot_wider() in .
  4. Explain when you would prefer wide format vs. long format for this type of data.

Exercise 3

Exercise 3: Data validation with assertr in

Download the data

You have collected reaction time data from a psychological experiment. Write validation rules using the assertr package in to check for the following conditions:

In in , write verify() statements to check that:

  1. All reaction times are between 200 and 2000 milliseconds
  2. All accuracy values are between 0 and 1
  3. All participant IDs start with “P”
  4. The condition column only contains “control” or “treatment”

Test your validation rules with the sample dataset. What errors do you find?

Exercise 4

Exercise 4: Tidy data principles

Download the data

Examine the following dataset and determine whether it follows tidy data principles:

  1. Is this dataset tidy? If not, what principles are violated?
  2. Reorganize this dataset to make it tidy. You may need to create multiple datasets if necessary.
  3. Explain your reasoning for the restructuring you chose.

Exercise 5

Exercise 5: Data organization best practices

You are starting a new research project studying the effects of different teaching methods on student performance. You plan to collect data from 3 schools, with 2 different teaching methods, over 4 time points, measuring math and reading scores.

  1. Design the structure for your main dataset following rectangular data principles. Include:

    • Appropriate variable names
    • A clear layout (wide or long format - justify your choice)
    • How you would handle missing data
    • What data validation rules you would implement
  2. Create a small example dataset (5-6 rows) that demonstrates your design.

  3. Write a brief data dictionary for your variables.

Exercise 6

Exercise 6: File naming and organization

You have the following research files that need to be renamed according to best practices:

final_results.xlsx
data backup.csv
Results-UPDATED-2024.csv
Math scores (corrected).xlsx
science_scores_final_FINAL.csv
reading data - Jan 15.txt
  1. Propose new file names that follow consistent naming conventions.
  2. Explain the naming convention you chose and why.
  3. How would you organize these files in a folder structure?

Exercise 7

Exercise 7: Fixing common data problems

Download the data

The following dataset contains several common data entry errors. Identify and fix them:

  1. List all the problems you can identify in this dataset.
  2. Clean the dataset by fixing these issues.
  3. Write validation rules that would catch these types of errors in the future.

Resources

References

Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2–10. https://doi.org/10.1080/00031305.2017.1375989.
Fischetti, T. (2023). Assertr: Assertive programming for r analysis pipelines. https://doi.org/10.32614/CRAN.package.assertr. R package version 3.0.1https://docs.ropensci.org/assertr/ (website) https://github.com/ropensci/assertr.