4  Project Structure

Summary

In this chapter, you will explore various types of project structures. You will learn what elements are essential for a project structure and what key factors to consider. Additionally, you will gain insight into how to create a project structure for your own work and understand what should be included in a published reproducibility archive.

Learning Objectives

πŸ’‘ You have explored various types of project structures.
πŸ’‘ You know the key components of a project structure.
πŸ’‘ You can evaluate the advantages and disadvantages of different project structures.
πŸ’‘ You are aware of the challenges and opportunities in creating an effective project structure.
πŸ’‘ You understand the perspective on research as either a sequential or circular process.

A good folder structure saves you and other researchers a lot of time when revisiting or trying to understand your data. However, there is no universal best practice standard for project structures. Below are several examples of different project structures:

4.1 Examples

Output
β”œβ”€β”€ LICENSE
1β”œβ”€β”€ README.md
β”œβ”€β”€ CODE_OF_CONDUCT.md <- Guidelines for users and contributors of the project.
β”œβ”€β”€ CONTRIBUTING.md    <- Information on how to contribute to the project.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚   └── raw            <- The original, immutable data dump.
β”‚
β”œβ”€β”€ docs               <- A default Sphinx project; see sphinx-doc.org for details
β”‚
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries
β”‚
β”œβ”€β”€ notebooks          <- Jupyter notebooks. The naming convention is a number (for ordering),
β”‚                         the creator's initials, and a short `-` delimited description, e.g.
β”‚                         `1.0-jqp-initial-data-exploration`.
β”‚
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”‚   └── figures        <- Generated graphics and figures to be used in reporting
β”‚
β”œβ”€β”€ project_management <- Meeting notes and other project planning resources
β”‚
β”œβ”€β”€ src                <- Source code for use in this project.
β”‚   β”‚
β”‚   β”œβ”€β”€ data           <- Scripts to download or generate data
β”‚   β”‚   └── make_dataset.py
β”‚   β”‚
β”‚   β”œβ”€β”€ models         <- Scripts to train models and then use trained models to make
β”‚   β”‚   β”‚                 predictions
β”‚   β”‚   β”œβ”€β”€ predict_model.py
β”‚   β”‚   └── train_model.py
β”‚   β”‚
β”‚   └── visualisation  <- Scripts to create exploratory and results-oriented visualisations
β”‚       └── visualise.py
└──
1
The top-level README for users of this project.

Repository Structure Template by The Turing Way. Used under the LICENSE CC-BY 4.0. Reused without any modifications.

.
β”œβ”€β”€ README.md
β”œβ”€β”€ analysis            <- all things data analysis
β”‚   └── src             <- functions and other source files
β”œβ”€β”€ comm
β”‚   β”œβ”€β”€ internal_comm   <- internal communication such as meeting notes
β”‚   └── journal_comm    <- communication with the journal, e.g. peer review
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ data_clean      <- clean version of the data
β”‚   └── data_raw        <- raw data (don't touch)
β”œβ”€β”€ dissemination
β”‚   β”œβ”€β”€ manuscripts
β”‚   β”œβ”€β”€ posters
β”‚   └── presentations
β”œβ”€β”€ documentation       <- documentation, e.g. data management plan
└── misc                <- miscellaneous files that don't fit elsewhere

Research Project Template by Heidi Seibold. No license use specified. For source code, see here. Reused without modifications.

|-- 01_Data
|   |-- 01_Raw
|   `-- 02_Clean
|-- 02_Analysis
|   |-- 01_Scripts
|   |-- 02_Results
|   |-- 03_Figures
|   `-- 04_Tables
|-- 03_Manuscript
|   |-- 01_Text
|   `-- 02_Final_figures
|-- 04_Presentation
|-- 05_Misc
|-- 06_Analysis_for_publication <-- optional
|   |-- 01_Scripts              <-- optional
|   |-- 02_Results              <-- optional
|   |-- 03_Figures              <-- optional
|   `-- 04_Tables               <-- optional
|-- README.md
|-- .gitignore                  <-- optional
|-- renv                        <-- optional

analysis template packages by Jonas Hagenbeck. Used under the MIT License. Reused without any modifications.

File Description Usage
_pkgdown.yml YAML for package website do not edit
DESCRIPTION R-package DESCRIPTION do not edit
LICENSE.md Project license do not edit
NAMESPACE R-package namespace machine-written
README.md Read this file to get started! do not edit
README.Rmd R-markdown source for readme.md human editable
worcs.Rproj RStudio project file do not edit
docs/ Package website machine-written
inst/ RStudio project template files human editable
man/ R-package documentation do not edit
paper/ WORCS paper source files human editable
R/ R-package source code human editable
vignettes/ R-package vignettes human editable

WORCS project structure by Van Lissa et al. (2021). Used under the GNU General Public License. No changes were made.

Exercise: Take on different perspectives

Pause for a moment after investigating the different project structures. Take the perspective of a researcher that is currently working in this project structure. How would you feel working with this structure? What helps you, hinders you to work productively and reproducible? Then change your perspective. Imagine you are a researcher aiming to reproduce the results of a different research project that uses such project structures. What would you need to reproduce analysis and results?

When looking at the different project structures, it becomes clear that there is no perfect or correct project structure for a research project. The variety and scale of projects is too diverse and include different demands of different parties. Nevertheless, in this chapter we try to summarise the most important aspects of a project structure in the context of research. We differentiate between basic standards that help in every project and optional standards that help in specific contexts of projects. Afterwards, we display reproducibility archives and the different requirements journals articulate for setting up an reproducibility archive.

4.2 Basic standards for research projects

First of all, these basic standards are highly subjective. As researchers in the discipline of psychology, our experience comes from psychological research projects. Researchers from other disciplines might need to configure the project structure elements. In psychology research, we assess these standards as highly useful:

  1. Human and machine-readable files
  2. README.md-file
  3. Separation of data and code
  4. renv-folder
  5. misc-folder

Apart from these standards, it is always useful to look for already existing standards in your research discipline. Researchers working with fMRI data will find the convention of Brain Images Data Structures (BIDS). BIDS is a standard that aims to foster a simple and easy way of organizing neuroimaging and behavioural data. Inspired by BIDS, psych-DS started in 2018 to aiming to provide a systematic way of formatting and documenting scientific datasets. However, content creation by psych-DS has been low. If you want to contribute to the project, take a look at their Github repository.

4.2.1 Human and machine-readable files

As displayed in the chapter Naming things, human and machine-readable names for files and folders are relatively easy to modify, when starting a project. As your project grows, you will be thankful for good file and folder names. Since, we dedicated a whole chapter to this topic, we move on to the next basic standard of a project structure.

4.2.2 README.md-file

A README is a text file that provides basic information about a project. It is typically a Markdown (.md) file, which means you can use Markdown syntax in it. Markdown allows you to easily format text, create lists, include links, and embed images. The exact content depends on your repository, but some general things that you might want to include are:

  • Project description: What function does this repository serve and what are it’s key features?

  • Installation instructions (if applicable): Explain how to install and set up your project, including any dependencies or prerequisites. This is particularly relevant for repositories that contain programming code. Provide clear instructions to help users or contributors get started with your project quickly.

  • Project structure: Explain your project structure. Use one sentence to describe each file and folder. Examples are displayed above.

  • Usage (if applicable): Provide examples or code snippets demonstrating how to use your project.

  • Contributing: If you welcome contributions, specify how others can contribute to your project. Here, you can also include guidelines for submitting bug reports, feature requests, or pull requests.

For larger or more complex projects where contributions may involve setting up a specific development environment or adhering to specific workflows, it is standard practice to create a file called CONTRIBUTING.md. GitHub recognizes the presence of a CONTRIBUTING.md file in a repository and, for example, automatically includes a link to the CONTRIBUTING.md file when users open a new issue or pull request.

  • Acknowledgments: Give credit to any third-party libraries, tools, or individuals that contributed to your project.

  • License: Choose a license that aligns with your project’s goals. You can use choosealicense.com for guidance. The chosen license influences contributions to your project. Researchers are often insecure about which license to use. We will there cover the topic license in a separate chapter.

5 The Repro Book

Website License: CC BY 4.0 Quarto Publish Codespell All Contributors

5.1 Description

Welcome to the Repro Book, an open-source online learning resource on reproducible research. The main goal of the Repro Book is to provide a companion online course text book for courses on reproducible research.

5.2 Repository Structure

5.2.1 Folders

.github/workflows is a folder that contains GitHub Actions workflows. Currently, two automatic workflows run automatically. This folder can be ignored by users who want to contribute to the book.

_extensions is a folder that contains Quarto extensions. Currently, this book only uses the fontawesome-extension to create icons like this: . This folder can be ignored by users who want to contribute to the book.

chapters is the heart of the book. The chapters folder contains all main chapters of the book. The content of each chapter is stored in its own .qmd file. If you want to contribute with content and modify chapters or create new ones, this is main the place where your adjustments take place.

images is a folder that does not contain images (yet). The images displayed in this book are stored in a NextCloud folder. If you want to contribute images to this book, please upload them to the NextCloud folder. For an explanation of how to use the images from NextCloud, see contributing.qmd.

renv is a folder that contains information about the R packages used in this book. If your contributions include the utility of an additional R package, see contributing.qmd for how to modify the relevant files.

5.2.2 Files

.Renviron is a file used in R that allows users to define environment variables.

.Rprofile is a file used in R that allows users to customize their R sessions by specifying code that should be executed every time R starts.

.all-contributorsrc is a file where the contributors of this project are listed. When you want to start working on this book, you will be included in this document. Content contribution is not affected by this file.

.codespellrc is a file connected to a an automated codespell check at GitHub. Content contribution is not affected by this file.

.gitignore is a file that specifies files and folders that should be ignored and not tracked by Git. It prevents certain files or directories that are not essential for the project or generated during the development process from being included in the version history. For more information about .gitignore, check out this link.

LICENSE is a file where you can see what others are and are not allowed to do with this project.

Makefile is a special text file that contains rules and instructions for automating the execution of software programs.

README.md is the file you are currently reading. It provides basic information about a project and a description of a repository.

_affiliations.yml is a metadata-file that contains information about the affiliations the contributors work at. If you contribute to this project and are affiliated to a different institution than University of Hamburg, you can extend the file for your institution.

_authors.yml is a metadata-file that contains information about the authors of this project. If you contribute to this project as an author, you can extend the file for your information.

_custom.scss is a used to customize the styling of Quarto documents by writing our own CSS (Cascading Style Sheets) rules.

_quarto.yml is a metadata-file that configures the project settings. For example, it configures the book title and subtitle and how the book chapters are arranged. If you contribute to this project with a new chapter, you have to add the location of the chapter in a meaningful way to the chapters-section in the _quarto.yml-file.

_quarto-pdf.yml is also a metadata-file that configures the project settings but only when we export the book to PDF.

_variables.yml is a metadata-file that specifies variables in a project. It is possible to use the variables in the files that constitute the chapters or similar.

acknowledgements.qmd is a closing book chapter where we describe which different tools we use for what.

contents.qmd is an opening book chapter where we describe what you can expect in each chapter.

index.qmd is a file that contains the content of the first site of the repro book. It serves as the cover of the book.

plausible.html is used to integrate Plausible Analytics, a privacy-focused web analytics tool. This file serves as a way to include the necessary HTML code to track website or document usage without compromising user privacy.

references.bib is a file that contains information about all the references we use in our book. It is written using BibTex. If you contribute to this book and add new references, please make sure that these references are stored in this file.

references.qmd is a closing chapter that lists all the mentioned references. It is automatically generated from the references.bib-file.

renv.lock is a file that saves R packages and their versions used in this book. If you contribute to this book and need additional R packages, please make sure that these packages are stored in this file.

repro-book.Rproj is a project file used by RStudio, an integrated development environment for R. It serves as a container for all files and settings related to a specific R project.

summary.qmd is a closing chapter of the book. It summarizes the most important insights.

5.3 License

Creative Commons Attribution 4.0 International (CC BY 4.0)

For details, see the LICENSE file.

5.4 Contributors

Lennart Wittkuhn
Lennart Wittkuhn

πŸ› πŸ’» πŸ–‹ πŸ“– 🎨 πŸ’΅ πŸ” πŸ€” πŸš‡ 🚧 πŸ§‘β€πŸ« πŸ“† πŸ“£ πŸ’¬ πŸ”¬ πŸ‘€ πŸ›‘οΈ πŸ“’
Justus Johannes Reihs
Justus Johannes Reihs

πŸ› πŸ’» πŸ€” 🚧 πŸ”¬ πŸ‘€ πŸ–‹ πŸ“–

A README.md-file can include a lot of content. When writing your README, consider your perspective as on-working researcher and other researchers that might want to reproduce parts of your project results.

5.4.1 Separation of data and code

As seen in the Examples, many researchers separate data and code in different folders. It is also recommended to separate raw and clean (or tidy) data. When conducting experiments in psychology, raw data usually do not come in a way that lets you start analysing the data. Most (if not all) of the time, you need to configure the data. These configurations should be based on code (rather than clicking in a graphical user interface). The code that makes the raw data clean, should be included in the code folder of your project. In R, a popular method for data cleaning is the use of the tidyverse, which we will cover in the chapter about good coding practices.

5.4.2 renv folder

A renv folder helps you to create reproducible environments for your r projects. renv saves and lists all your R-packages used in your project. That means you have for each project an own project library in R. When you are working in R to run your data analyses, you can create the renv-folder directly. However, the use of renv is not intuitive, so we will cover the topic in another separate chapter.

5.4.3 misc-folder

The abbreviation misc stands for miscellaneous. This folder is particularly useful for files that do not fit into one of your other folders. It saves you from having a chaotic project folder when entering your project.

5.5 YODA principles

5.6 Optional standards for research projects

5.7 Structure of reproducibility archives

These are only comments. There is no aim of sufficiency and precision in the ideas below.

  • Different for each project and journal, when it comes to paper publishing
  • Show example of requirements of journal
  • Critically discuss if journal requirements are sufficient for computational reproducibility
  • Something’s missing?

Below the line, the actual content starts


In the first subsections of the chapter, you have learned how to set up your own project structure. The structure is flexible to different occasions, e.g. working collaboratively with other team members, working at your own research project during your studies at university. Nevertheless, you have not learned what is actually required by journals when you need or want to make your data and code publicly available. To cover this topic, we will go into the details of open science requirements of a couple of journals.

5.7.1 Requirements of Frontiers in Psychology

The TOP Guidelines by the Center of Open Science are a best practice formulation by journals, funders, and societies to implement better and more transparent research. The TOP Factor is a metric that reports the steps that a journal is taking to implement open science practices. Below you can see the TOP Factor evaluation of two psychological journals.

The TOP-FACTOR is a metric that assesses the open science policies of papers. It considers different dimensions, such as data transparency and materials transparency. In each dimension, a journal can achieve zero to three points. A journal receives zero points in a dimension, when it is not implemented in the open science policy. It receives three points in a dimension, when reproducibility of data and materials is not only required but also positively tested. The sum score over all dimensions constitutes the TOP Factor. The criteria for the scoring system is listed below:

NOT IMPLEMENTED LEVEL I LEVEL II LEVEL III
Data Citation No mention of data citation. Journal describes citation of data in guidelines to authors with clear rules and examples. Article provides appropriate citation for data and materials used consistent with journal’s author guidelines. Article is not published until providing appropriate citation for data and materials following journal’s author guidelines.
Data Transparency Journal encourages data sharing, or says nothing. Article states whether data are available, and, if so, where to access them. Data must be posted to a trusted repository. Exceptions must be identified at article submission. Data must be posted to a trusted repository, and reported analyses will be reproduced independently prior to publication.
Analysis Code Transparency Journal encourages code sharing, or says nothing. Article states whether code is available, and, if so, where to access it. Code must be posted to a trusted repository. Exceptions must be identified at article submission. Code must be posted to a trusted repository, and reported analyses will be reproduced independently prior to publication.
Materials Transparency Journal encourages materials sharing, or says nothing. Article states whether materials are available, and, if so, where to access them. Materials must be posted to a trusted repository. Exceptions must be identified at article submission. Materials must be posted to a trusted repository, and reported analyses will be reproduced independently prior to publication.
Design & Analysis Reporting Guidelines Journal encourages design and analysis transparency, or says nothing. Journal articulates design transparency standards. Journal requires adherence to design transparency standards for review and publication. Journal requires and enforces adherence to design transparency standards for review and publication.
Study Preregistration Journal says nothing. Article states whether preregistration of study exists, and, if so, where to access it. Article states whether preregistration of study exists, and, if so, allows journal access during peer review for verification. Journal requires preregistration of studies and provides link and badge in article to meeting requirements.
Analysis Plan Preregistration Journal says nothing. Article states whether preregistration of study exists, and, if so, where to access it. Article states whether preregistration with analysis plan exists, and, if so, allows journal access during peer review for verification. Journal requires preregistration of studies with analysis plans and provides link and badge in article to meeting requirements.
Replication Journal discourages submission of replication studies, or says nothing. Journal encourages submission of replication studies. Journal encourages submission of replication studies and conducts results blind review. Journal uses Registered Reports as a submission option for replication studies with peer review prior to observing the study outcomes.

TOP Guidelines Summary Table. License: CC BY 4.0. Used without any modifications.

TOP-Factor assessment of Frontiers in Psychology

GUIDELINE LEVEL SUMMARY JUSTIFICATION
Total 13
Data Citation 1 Journal describes citation of data in guidelines to authors with clear rules and examples. β€œAuthors are encouraged to cite all datasets generated or analyzed in the study.” Includes example.
Data Transparency 2 Data must be posted to a trusted repository. Exceptions must be identified at article submission. β€œFrontiers is committed to open science and open data; we require that authors make available all data relevant to the conclusions of the manuscript.”
Analysis Code Transparency 2 Code must be posted to a trusted repository. Exceptions must be identified at article submission. β€œFrontiers is committed to open science and open data; we require that authors make available all code used to conduct their research available to other researchers.”
Materials Transparency 2 Materials must be posted to a trusted repository. Exceptions must be identified at article submission. β€œAuthors are required to make all materials used to conduct their research available to other researchers.”
Design & Analysis Reporting Guidelines 0 Journal encourages design and analysis transparency, or says nothing. No mention.
Study Preregistration 0 Journal says nothing. No mention.
Analysis Plan Preregistration 0 Journal says nothing. No mention.
Replication 3 Journal uses Registered Reports as a submission option for replication studies with peer review prior to observing the study outcomes. Journal accepts Registered Reports.
Registered Reports & Publication Bias 3 Journal accepts Registered Reports.
Open Science Badges 0 No mention.

Open access policy of Frontiers in Psychology summarised by TOP Factor. License: CC BY 4.0. Used without any modifications.

TOP-Factor assessment of Advances in Methods and Practices in Psychological Science

GUIDELINE LEVEL SUMMARY JUSTIFICATION
Total 25
Data Citation 2 Article provides appropriate citation for data and materials used consistent with journal’s author guidelines. Journal requires data citation.
Data Transparency 2 Data must be posted to a trusted repository. Exceptions must be identified at article submission. Journal requires data sharing.
Analysis Code Transparency 2 Code must be posted to a trusted repository. Exceptions must be identified at article submission. Journal requires code sharing.
Materials Transparency 2 Materials must be posted to a trusted repository. Exceptions must be identified at article submission. Journal requires materials sharing.
Design & Analysis Reporting Guidelines 3 Journal requires and enforces adherence to design transparency standards for review and publication. Adherence to reporting guidelines is enforced.
Study Preregistration 3 Journal requires preregistration of studies and provides link and badge in article to meeting requirements. Preregistration is required.
Analysis Plan Preregistration 3 Journal requires preregistration of studies with analysis plans and provides link and badge in article to meeting requirements. Analysis plan preregistration is required.
Replication 3 Journal uses Registered Reports as a submission option for replication studies with peer review prior to observing the study outcomes. Journal accepts Registered Reports for replication studies.
Registered Reports & Publication Bias 3 Journal accepts Registered Reports for novel studies.
Open Science Badges 2 Journal offers all Open Science Badges.

As you can see, journals do include requirements of public data and code for publishing articles. However, the specification of how to do that is scarce. Standards for research projects as stated above are missing. We further look at some examples of the application of open data and code policies in published articles.

5.7.2 Current application of open science practices

Click on the following links to explore the reproducibility archives:

Exercise: Study reproducibility archives

Investigate the links above and think about how these archives could help you to reproduce the analysis conducted in their papers. What is helpful and what is not?

Again, it should be noted that there is no perfectly organized reproducibility archive. However, we hope that the displayed examples show that open science, open data and reproducibility can be applied in many different (and sadly sometimes insufficient) ways.