Research Workflow with DataLad

.title[
# Research Workflow with DataLad - A discussion
]
.subtitle[
## Lab meeting | Lifespan Neural Dynamics Group (MPIB)
]
.author[
### Lennart Wittkuhn | <a href="mailto:wittkuhn@mpib-berlin.mpg.de" class="email">wittkuhn@mpib-berlin.mpg.de</a>
]
.institute[
### Max Planck Research Group NeuroCode</br>Max Planck Institute for Human Development</br>Max Planck UCL Centre for Computational Psychiatry and Ageing Research</br>Berlin, Germany
]
.date[
### Wednesday, 20<sup>th</sup> of October 2021
]

---

# About

#### About me

- PhD student at the [Max Planck Research Group "NeuroCode"](https://www.mpib-berlin.mpg.de/research/research-groups/mprg-neurocode) at the [Max Planck Institute for Human Development](https://www.mpib-berlin.mpg.de/en) in Berlin
- Research: I study the role of fast neural memory reactivation ([*replay*](https://en.wikipedia.org/wiki/Hippocampal_replay)) in decision-making in humans using fMRI
- Member of the MPIB's working group on research data management
- You can contact me via [email](mailto:wittkuhn@mpib-berlin.mpg.de), [Twitter](https://twitter.com/lnnrtwttkhn),
[GitHub](https://github.com/lnnrtwttkhn) or
[LinkedIn](https://www.linkedin.com/in/lennart-wittkuhn-6a079a1a8/)
- Find out more about my work on [my website](https://lennartwittkuhn.com/), [Google Scholar](https://scholar.google.de/) and [ORCiD](https://orcid.org/0000-0003-2966-6888)

#### About this presentation

- **Slides:** Reproducible slides are publicly available via https://lennartwittkuhn.com/talk-rdm/
- **Software:** Written in [RMarkdown](https://bookdown.org/yihui/rmarkdown/) using the [xaringan](https://github.com/yihui/xaringan) package, run in [Docker](https://www.docker.com/), deployed to [GitHub Pages](https://pages.github.com/) using [Travis CI](https://travis-ci.org/)
- **DOI:** [10.5281/zenodo.5012477](http://doi.org/10.5281/zenodo.5012477) (generated using GitHub releases + Zenodo, see details [here](https://guides.github.com/activities/citable-code/))
- **Source:** Source code is publicly available on GitHub: https://github.com/lnnrtwttkhn/talk-rdm/
- **Links:** This presentation contains links to external resources. I do not take responsibility for the accuracy, legality or content of the external site or for that of subsequent links. If you notice an issue with a link, please contact me!
- **Contact**: I am happy for any feedback or suggestion via [email](mailto:wittkuhn@mpib-berlin.mpg.de) or [GitHub issues](https://github.com/lnnrtwttkhn/talk-rdm/issues). Thank you! 🙏

---

# Agenda

1. **Introduction**
  - Version control
2. **Research Workflow with DataLad**
  - Intro to DataLad
  - Reproducibility
  - Code and Data Organization
  - Datalad on Tardis
  - Data Sharing with DataLad
3. **Discussion**

---

# Introduction

---

# The need for *proper* version-control in a nutshell

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://phdcomics.com/comics/archive/phd101212s.gif" alt="&lt;a href=&quot;http://phdcomics.com/comics/archive/phd101212s.gif&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;&amp;copy; Jorge Cham (phdcomics.com)&lt;/sup&gt;&lt;/a&gt;" width="33%" />
<p class="caption"><a href="http://phdcomics.com/comics/archive/phd101212s.gif" target="_blank"><sup>&copy; Jorge Cham (phdcomics.com)</sup></a></p>
</div>

---

# The need for *proper* version-control in a nutshell

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://phdcomics.com/comics/archive/phd052810s.gif" alt="&lt;a href=&quot;http://phdcomics.com/comics/archive/phd052810s.gif&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;&amp;copy; Jorge Cham (phdcomics.com)&lt;/sup&gt;&lt;/a&gt;" width="55%" />
<p class="caption"><a href="http://phdcomics.com/comics/archive/phd052810s.gif" target="_blank"><sup>&copy; Jorge Cham (phdcomics.com)</sup></a></p>
</div>

---

# What is version control?

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://zenodo.org/record/3695300/files/VersionControl.jpg?download=1" alt="&lt;a href=&quot;https://zenodo.org/record/3695300&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;by Scriberia for The Turing Way community (CC-BY 4.0)&lt;/sup&gt;&lt;/a&gt;" width="100%" />
<p class="caption"><a href="https://zenodo.org/record/3695300" target="_blank"><sup>by Scriberia for The Turing Way community (CC-BY 4.0)</sup></a></p>
</div>
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://zenodo.org/record/3695300/files/ProjectHistory.jpg?download=1" alt="&lt;a href=&quot;https://zenodo.org/record/3695300&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;by Scriberia for The Turing Way community (CC-BY 4.0)&lt;/sup&gt;&lt;/a&gt;" width="100%" />
<p class="caption"><a href="https://zenodo.org/record/3695300" target="_blank"><sup>by Scriberia for The Turing Way community (CC-BY 4.0)</sup></a></p>
</div>
]

.center[
- keep files organized
- keep track of changes
- revert changes or go back to previous versions
]

---

# Workflow: Data Management using DataLad

---

# What is DataLad?

#### What is DataLad? (see the [10,000 feet](http://handbook.datalad.org/en/latest/intro/executive_summary.html) and [brief](http://handbook.datalad.org/en/latest/intro/philosophy.html) overview in the DataLad Handbook by [Wagner et al., 2020, *Zenodo*](https://doi.org/10.5281/ZENODO.3905791))

> *"DataLad is a software tool developed to aid with everything related to the evolution of digital objects"*

- **"Git for (large) data"**
- free, [open-source](https://github.com/datalad/datalad) **command-line tool**
- building on top of **Git** and **git-annex**, DataLad allows you to **version control arbitrarily large files** in datasets.
- *"Arbitrarily large?"* - yes, see DataLad dataset of 80TB / 15 million files from the Human Connectome Project (see [details](https://handbook.datalad.org/en/latest/usecases/HCP_dataset.html#usecase-hcp-dataset))

#### DataLad phiolosophy

- DataLad knows only two things: Datasets and files
- DataLad datasets are Git repository
- DataLad can version control arbitrarily large data
- DataLad minimizes custom procedures and data structures
- DataLad is developed for complete decentralization
- DataLad aims to maximize the (re-)use of existing 3rd-party data resources and infrastructure

---

# DataLad: What is a dataset?

> *"A dataset is a directory on a computer that DataLad manages."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/dataset.svg" alt="see &lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-101-create.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Create a new dataset&lt;/a&gt;" width="40%" />
<p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-101-create.html" target="_blank">DataLad Handbook: Create a new dataset</a></p>
</div>

> "*You can create new, empty datasets [..] and populate them, or transform existing directories into datasets.*"

---

# DataLad: Version-control arbitrarily large files

> *"Building on top of Git and git-annex, DataLad allows you to version control arbitrarily large files in datasets."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/local_wf.svg" alt="see &lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-102-populate.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: How to pupulate a dataset&lt;/a&gt;" width="40%" />
<p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-102-populate.html" target="_blank">DataLad Handbook: How to pupulate a dataset</a></p>
</div>

> *"[...] keep track of revisions of data of any size, and view, interact with or restore any version of your dataset [...]."*

---

# DataLad: Dataset consumption and collaboration

> *"DataLad lets you consume datasets provided by others, and collaborate with them."*

> *"You can **install existing datasets** and update them from their sources, or create sibling datasets that you can **publish updates** to and **pull updates** from for collaboration and data sharing."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/collaboration.svg" alt="see &lt;a href=&quot;https://handbook.datalad.org/en/latest/basics/101-105-install.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Install an existing dataset&lt;/a&gt;" width="70%" />
<p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-105-install.html" target="_blank">DataLad Handbook: Install an existing dataset</a></p>
</div>

---

# DataLad: Dataset linkage

> *"Datasets can contain other datasets (subdatasets), **nested arbitrarily deep.**"*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/linkage_subds.svg" alt="see &lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-106-nesting.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Nesting datasets&lt;/a&gt;" width="70%" />
<p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-106-nesting.html" target="_blank">DataLad Handbook: Nesting datasets</a></p>
</div>

> *"Each dataset has an independent [...] history, but can be registered at a precise version in higher-level datasets. This allows to **combine datasets** and to perform commands recursively across a hierarchy of datasets, and it is the basis for advanced provenance capture abilities."*

---

# DataLad: Full provenance capture and reproducibility

> *"DataLad allows to **capture full provenance**: The origin of datasets, the origin of files obtained from web sources, complete machine-readable and automatically reproducible records of how files were created (including software environments)."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/reproducible_execution.svg" alt="see &lt;a href=&quot;http://handbook.datalad.org/en/latest/usecases/provenance_tracking.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Provenance tracking&lt;/a&gt; and &lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/basics-run.html&quot; target=&quot;_blank&quot;&gt;run commands&lt;/a&gt;" width="50%" />
<p class="caption">see <a href="http://handbook.datalad.org/en/latest/usecases/provenance_tracking.html" target="_blank">DataLad Handbook: Provenance tracking</a> and <a href="http://handbook.datalad.org/en/latest/basics/basics-run.html" target="_blank">run commands</a></p>
</div>

> *"You or your collaborators can thus re-obtain or reproducibly **recompute content with a single command**, and make use of extensive provenance of dataset content **(who created it, when, and how?)**."*

---

# DataLad: Third party service integration

> *"**Export datasets to third party services** such as GitHub, GitLab, or Figshare with built-in commands."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/thirdparty.svg" alt="see &lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Third-party infrastructure&lt;/a&gt;" width="60%" />
<p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/basics-thirdparty.html" target="_blank">DataLad Handbook: Third-party infrastructure</a></p>
</div>

> *"Alternatively, you can use a **multitude of other available third party services** such as Dropbox, Google Drive, Amazon S3, owncloud, or many more that DataLad datasets are compatible with."*

---
exclude: true

# DataLad: Metadata handling

> *"**Extract, aggregate, and query dataset metadata.** This allows to automatically obtain metadata according to different metadata standards (EXIF, XMP, ID3, BIDS, DICOM, NIfTI1, ...), store this metadata in a portable format, share it, and search dataset contents."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/metadata_prov_imaging.svg" alt="see &lt;a href=&quot;http://docs.datalad.org/en/stable/metadata.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Metadata&lt;/a&gt;" width="100%" />
<p class="caption">see <a href="http://docs.datalad.org/en/stable/metadata.html" target="_blank">DataLad Handbook: Metadata</a></p>
</div>

---

# Random useful info about DataLad

#### Use DataLad within Python 🐍 and R 🏴‍☠️

- DataLad Python API Use DataLad commands directly in your Python scripts
- Install DataLad `pip install datalad` and import in your Python script `import datalad.api as dl`
- Use system commands in other languages, e.g., in R `system2("datalad status")`

#### Keep only what you need (aka. "How to work on two fMRI studies with a 250GB laptop")

- `datalad drop` removes the file contents completely from your dataset
- only keep whatever you like or re-obtain with `datalad get`

#### git-annex takes the safety of your files seriously

- Files saved under git-annex are locked against modifications
- `datalad run` automatically unlocks specified inputs / outputs
- `datalad unlock` can be used to unlock annexed content manually
- Everything that is stored under git-annex is content-locked and everything that is stored under Git is not

---

# Git vs. git-annex

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/git_vs_gitannex.svg" alt="&lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-114-txt2git.html&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;DataLad Handbook: Data Safety&lt;/sup&gt;&lt;/a&gt;" width="100%" />
<p class="caption"><a href="http://handbook.datalad.org/en/latest/basics/101-114-txt2git.html" target="_blank"><sup>DataLad Handbook: Data Safety</sup></a></p>
</div>

- Example: `datalad create -c text2git my_dataset`<br>&rarr; all text files are saved under Git
- the `.gitattributes` file handles, which files are stored under Git vs. git-annex (can modify manually)
- see [this chapter](http://handbook.datalad.og/en/latest/basics/basics-configuration.html#chapter-config) in the DataLad handbook
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/publishing_network_publishparts2.svg" alt="&lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;DataLad Handbook: Beyond shared infrastructure&lt;/sup&gt;&lt;/a&gt;" width="100%" />
<p class="caption"><a href="http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html" target="_blank"><sup>DataLad Handbook: Beyond shared infrastructure</sup></a></p>
</div>
]

---

# Workflow: Our paper

---

### ❓ "*How close are you to full reproducibility?*"

---

# Reproducible research

> *"[...] when the same analysis steps performed on the same dataset consistently produces the same answer."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://the-turing-way.netlify.app/_images/reproducible-matrix.jpg" alt="&lt;a href=&quot;https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html&quot; target=&quot;_blank&quot;&gt;Table of Definitions for Reproducibility&lt;/a&gt; by &lt;i&gt;The Turing Way&lt;/i&gt; (CC-BY 4.0)" width="70%" />
<p class="caption"><a href="https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html" target="_blank">Table of Definitions for Reproducibility</a> by <i>The Turing Way</i> (CC-BY 4.0)</p>
</div>

???

- **Reproducible:** A result is reproducible when the same analysis steps performed on the same dataset consistently produces the same answer.
- **Replicable:** A result is replicable when the same analysis performed on different datasets produces qualitatively similar answers.
- **Robust:** A result is robust when the same dataset is subjected to different analysis workflows to answer the same research question (for example one pipeline written in R and another written in Python) and a qualitatively similar or identical answer is produced. Robust results show that the work is not dependent on the specificities of the programming language chosen to perform the analysis.
- **Generalisable:** Combining replicable and robust findings allow us to form generalisable results. Note that running an analysis on a different software implementation and with a different dataset does not provide generalised results. There will be many more steps to know how well the work applies to all the different aspects of the research question. Generalisation is an important step towards understanding that the result is not dependent on a particular dataset nor a particular version of the analysis pipeline.

---

# Our paper

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://keeper.mpdl.mpg.de/f/ea0795d894e44fd3ad18/?dl=1" alt="&lt;a href=&quot;https://doi.org/10.1038/s41467-021-21970-2&quot; target=&quot;_blank&quot;&gt;doi: 10.1038/s41467-021-21970-2&lt;/a&gt; (accessed 17/06/21)" width="75%" />
<p class="caption"><a href="https://doi.org/10.1038/s41467-021-21970-2" target="_blank">doi: 10.1038/s41467-021-21970-2</a> (accessed 17/06/21)</p>
</div>

#### Two-sentence summary:

> Non-invasive measurement of fast neural activity with spatial precision in humans is difficult. Here, the authors show how fMRI can be used to detect sub-second neural sequences in a localized fashion and report fast replay of images in visual cortex that occurred independently of the hippocampus.

---

# Example: Data management using DataLad

#### From Wittkuhn & Schuck, 2021, *Nature Communications* (see [Data Availability statement](https://www.nature.com/articles/s41467-021-21970-2#data-availability)):

> *"We publicly share all data used in this study. Data and code management was realized using DataLad.*"

- All individual datasets can be found at: https://gin.g-node.org/lnnrtwttkhn
- Each dataset is associated with a unique URL and a Digital Object Identifier (DOI)
- Dataset structure shared to GitHub and dataset contents shared to GIN

#### All data?

- `highspeed`: superdataset of all subdatasets, incl. project documentation ([GitLab](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed))
- `highspeed-bids`: MRI and behavioral data adhering to the [BIDS standard](https://bids.neuroimaging.io/)
([GitHub](https://github.com/lnnrtwttkhn/highspeed-bids),
[GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-bids),
[DOI](https://doi.org/10.12751/g-node.4ivuv8))
- `highspeed-mriqc`: MRI quality metrics and reports based on [MRIQC](https://mriqc.readthedocs.io/en/stable/) 
([GitHub](https://github.com/lnnrtwttkhn/highspeed-mriqc),
[GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-mriqc),
[DOI](https://doi.org/10.12751/g-node.0vmyuh))
- `highspeed-fmriprep`: preprocessed MRI data using [fMRIPrep](https://fmriprep.org/en/stable/),
([GitHub](https://github.com/lnnrtwttkhn/highspeed-fmriprep),
[GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-fmriprep),
[DOI](https://doi.org/10.12751/g-node.0ft06t))
- `highspeed-masks`: binarized anatomical masks used for feature selection ([GitHub](https://github.com/lnnrtwttkhn/highspeed-masks),
[GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-masks), [DOI](https://doi.org/10.12751/g-node.omirok))
- `highspeed-glm`: first-level GLM results used for feature selection ([GitHub](https://github.com/lnnrtwttkhn/highspeed-glm),
[GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-glm),
[DOI](https://doi.org/10.12751/g-node.d21zpv))
- `highspeed-decoding`: results of the multivariate decoding approach ([GitHub](https://github.com/lnnrtwttkhn/highspeed-decoding), [GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-decoding), [DOI](https://doi.org/10.12751/g-node.9zft1r))
- `highspeed-data`: unprocessed data of the behavioral task acquired during MRI acquisition ([GitHub](https://github.com/lnnrtwttkhn/highspeed-data-behavior),
[GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-data-behavior),
[DOI](https://doi.org/10.12751/g-node.p7dabb))

\> 1.5 TB in total, version-controlled using DataLad

---

# Superdataset to collect all resources of the project

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://keeper.mpdl.mpg.de/f/40e43c7e029a4f4696b8/?dl=1" alt="see &lt;a href=&quot;https://git.mpib-berlin.mpg.de/wittkuhn/highspeed&quot; target=&quot;_blank&quot;&gt;main project repo on GitLab&lt;/a&gt; (accessed 21/06/21)" width="85%" />
<p class="caption">see <a href="https://git.mpib-berlin.mpg.de/wittkuhn/highspeed" target="_blank">main project repo on GitLab</a> (accessed 21/06/21)</p>
</div>

---

# ❓ *"How close are you to full reproducibility?"*

> **[...] *full* reproducibility**?

#### Reproducibility of statistical results and figures in our [recent paper](https://www.nature.com/articles/s41467-021-21970-2#code-availability):

- Our [project website](https://wittkuhn.mpib.berlin/highspeed/) shows all figures and statistical results next to the corresponding R code
- The analyses are written in [RMarkdown](https://bookdown.org/yihui/rmarkdown/) notebooks which are run and rendered into the project website using [bookdown](https://bookdown.org/yihui/bookdown/) and deployed to [GitLab pages](https://docs.gitlab.com/ee/user/project/pages/) using [continuous integration (CI)](https://docs.gitlab.com/ee/ci/) (for details, see [here](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.gitlab-ci.yml))
- The input data are retrieved from DataLad datasets in the CI (see [here](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.gitlab-ci.yml#L5-93))
- R and DataLad are run in dedicated Docker containers (see [here](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.docker/bookdown/Dockerfile) and [here](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.docker/datalad/Dockerfile) for the Docker recipes)

#### Reproducibility *beyond* statistical results and figures reported in the paper:

- Pre-processing (HeuDiConv, fMRIPrep, MRIQC) containerized using Singularity
- `requirements.txt` files for Python code as part of the repo
- Most analyses run on cluster - tricky to reproduce? 🤷‍♂️

---

# Software containers and virtual environments

#### Software containers

> *"Containers allow a researcher to package up a project with all of the parts it needs - such as libraries, dependencies, and system settings - and ship it all out as one package."* (see [The Turing Way](https://the-turing-way.netlify.app/reproducible-research/renv/renv-containers.html#what-are-containers))

- `highspeed-bids`: containerized conversion of MRI data to BIDS using [HeuDiConv](https://hub.docker.com/r/nipy/heudiconv)
- `highspeed-fmriprep`: containerized execution of pre-processing pipeline [fMRIPrep](https://fmriprep.org/en/stable/singularity.html)
- `highspeed-mriqc`: containerized creation of MRI quality reports using [MRIQC](https://mriqc.readthedocs.io/en/stable/docker.html)
- `highspeed-analysis`: containerized execution of statistical analyses in [custom R container](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.docker/bookdown/Dockerfile)
- `tools`, a personal collection of commonly used containers in a DatLad dataset (see [details](https://github.com/lnnrtwttkhn/tools))

#### Virtual environments (e.g., [in Python](https://docs.python.org/3/tutorial/venv.html))

> *"[...] it may not be possible for one Python installation to meet the requirements of every application. The solution for this problem is to create a virtual environment, a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages."*

```bash
pip freeze > requirements.txt
```

---

# Workflow: Code and Data Organization

---

### ❓ "*What does your project structure look like?*"
### ❓ "*How do you connect different analyses, e.g., pre-processing and analysis, using DataLad?*"

---

# Summary

#### ❓ "*What does your project structure look like?*" /  "*What should a project structure look like?*"

1. Do what works for you!
1. Rely on community standards (e.g., [BIDS](https://bids.neuroimaging.io/) or [Psych-DS](https://docs.google.com/document/d/1u8o5jnWk0Iqp_J06PTu5NjBfVsdoPbBhstht6W0fFp0)) and code style guides
1. Keep it simple and modular (see e.g., [YODA principles](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html)): `input` &rarr; `code` &rarr; `output`
1. Document as much as possible (`README`s etc.)

#### ❓ "*How do you connect different analyses, e.g., pre-processing and analysis, using DataLad?*"

- [Nesting](https://handbook.datalad.org/en/latest/basics/101-106-nesting.html) of modular DataLad datasets
- Install input subdatasets in `inputs` directory

---

# Challenge: Standardizing data and code organization

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://imgs.xkcd.com/comics/standards.png" alt="&lt;a href=&quot;https://xkcd.com/927/&quot; target=&quot;_blank&quot;&gt;xkcd cartoon &quot;Standards&quot;&lt;/a&gt;" width="65%" />
<p class="caption"><a href="https://xkcd.com/927/" target="_blank">xkcd cartoon "Standards"</a></p>
</div>

<sup>&rarr; also see [slides](https://www.nipreps.org/assets/ORN-Workshop/) by Oscar Esteban on "Building communities around reproducible workflows"</sup>

---

# Example: Brain Imaging Data Structure (BIDS)

#### Organization of neuroimaging data according to the [Brain Imaging Data Structure (BIDS)](https://bids.neuroimaging.io/)

> *"A simple and intuitive way to organize and describe your neuroimaging and behavioral data."*

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fsdata.2016.44/MediaObjects/41597_2016_Article_BFsdata201644_Fig1_HTML.jpg?as=webp" alt="see Gorgolewski et al., 2016, &lt;i&gt;Nature Scientific Data&lt;/i&gt;&lt;/br&gt;&lt;a href=&quot;https://doi.org/10.1038/sdata.2016.44&quot; target=&quot;_blank&quot;&gt;doi: 10.1038/sdata.2016.44&lt;/a&gt;" width="60%" />
<p class="caption">see Gorgolewski et al., 2016, <i>Nature Scientific Data</i></br><a href="https://doi.org/10.1038/sdata.2016.44" target="_blank">doi: 10.1038/sdata.2016.44</a></p>
</div>

<sup>for those interested: fully automated transformation of newly acquired data using [ReproIn](https://github.com/ReproNim/reproin
) / [HeuDiConv](https://heudiconv.readthedocs.io/en/latest/)</sup>

---

# Code sharing using Git and DataLad

#### From Wittkuhn & Schuck, 2021, *Nature Communications* (see [Code Availability statement](https://www.nature.com/articles/s41467-021-21970-2#code-availability)):

> "*We share all code used in this study. An overview of all the resources is publicly available on our project website: https://wittkuhn.mpib.berlin/highspeed/.*"

- `highspeed-analysis`: code for the main statistical analyses
([GitHub](https://github.com/lnnrtwttkhn/highspeed-analysis),
[GIN](https://gin.g-node.org/lnnrtwttkhn/highspeed-analysis),
[DOI](https://doi.org/10.12751/g-node.eqqdtg))
- `highspeed-task`: code for the behavioral task ([GitHub](https://github.com/lnnrtwttkhn/highspeed-task),
[Zenodo](https://doi.org/10.5281/zenodo.4305888))

#### ... and the rest?

> *"We [...] share all data listed in the Data availability section in modularized units alongside the code that created the data, usually in a dedicated `code` directory in each dataset, instead of separate data and code repositories."*

> *"This approach allows to better establish the provenance of data (i.e., a better understanding which code and input data produced which output data), loosely following the **DataLad YODA principles** [...]*"

---

# **Y**ODAs **O**rganigam on **D**ata **A**nalysis

#### P1: *"One thing, one dataset"* (**Modularity**)

#### P2: *"Record where you got it from, and where it is now"* (**Provenance**)

#### P3: *"Record what you did to it, and with what"* (**Reproducibility**)

```bash
.
├── CHANGELOG.md
├── README.md
├── code
├── input
└── output
3 directories, 2 files
```

#### Learn about YODA, you must:
- DataLad Handbook: "YODA: Best practices for data analyses in a dataset" (see [details](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html))
- "YODA: YODA's Organigram on Data Analysis" - Poster by Hanke et al., 2018, presented at the 24th Annual Meeting of the Organization for Human Brain Mapping (OHBM) 2018 | CC-BY 4.0, [doi: 10.7490/f1000research.1116363.1](https://doi.org/10.7490/f1000research.1116363.1)

&rarr; Details on YODA principles can also be found in the Appendix

---

# P1: *"One thing, one dataset"*

- Structure study elements (data, code, results) in dedicated directories
- Input data in `/inputs`, code in `/code`, results in `/outputs`, execution environments in `/envs`
- Use dedicated projects for multiple different analyses

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/dataset_modules.svg" alt="see &lt;a href=&quot;https://handbook.datalad.org/en/latest/basics/101-127-yoda.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: YODA: Best practices for data analyses in a dataset&lt;/a&gt;" width="60%" />
<p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p>
</div>

---

# P2: *"Record where you got it from, and where it is now"*

- Record where the data came from, or how it is dependent on or linked to other data
- Link re-usable data resource units as DataLad *subdatasets*
- `datalad clone`, `datalad download-url`, `datalad save`

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/data_origin.svg" alt="see &lt;a href=&quot;https://handbook.datalad.org/en/latest/basics/101-127-yoda.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: YODA: Best practices for data analyses in a dataset&lt;/a&gt;" width="70%" />
<p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p>
</div>
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/decentralized_publishing.svg" alt="see &lt;a href=&quot;https://handbook.datalad.org/en/latest/basics/101-127-yoda.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: YODA: Best practices for data analyses in a dataset&lt;/a&gt;" width="120%" />
<p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p>
</div>
]

---

# P3: *"Record what you did to it, and with what"*

- Know how exactly the content of every file came to be that was not obtained from elsewhere
- `datalad run` links input data with code execution to output data
- `datalad containers-run` allows to do the same *within* software containers (e.g., Docker or Singularity)

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://handbook.datalad.org/en/latest/_images/decentralized_publishing.svg" alt="see &lt;a href=&quot;https://handbook.datalad.org/en/latest/basics/101-127-yoda.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: YODA: Best practices for data analyses in a dataset&lt;/a&gt;" width="50%" />
<p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-127-yoda.html" target="_blank">DataLad Handbook: YODA: Best practices for data analyses in a dataset</a></p>
</div>

---

# Dataset nesting

.pull-left[
- One can *nest* other DataLad datasets arbitrarily deep
- Nested datasets are called "subdatasets"
- Nested subdatasets look and feel just like a normal (sub-)directories in your project directory
{{content}}
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="https://handbook.datalad.org/en/latest/_images/virtual_dstree_dl101.svg
" alt="see &lt;a href=&quot;https://handbook.datalad.org/en/latest/basics/101-106-nesting.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Dataset nesting&lt;/a&gt;" width="100%" />
<p class="caption">see <a href="https://handbook.datalad.org/en/latest/basics/101-106-nesting.html" target="_blank">DataLad Handbook: Dataset nesting</a></p>
</div>
]

#### Advantages

- Lower-level datasets ("subdatasets") have an independent stand-alone history (**modularity** ✨)
- The top-level "superdataset" only stores *which version* of the subdataset is currently used
- Subdatsets need to be updated explictly
{{content}}

#### Git users

- A subdataset is essentially a [Git submodule](https://git-scm.com/book/de/v2/Git-Tools-Submodule)
- The version is registered using the [shasum](https://handbook.datalad.org/en/latest/glossary.html#term-shasum) of the latest commit of the cloned subdataset

---

# Workflow: DataLad on Tardis

---

### ❓ "*Do you primarily work on Tardis? What is there to consider?*"
### ❓ "*Do you keep data on Tardis temporally until all analyses are completed?*"
### ❓ "*How do you manage input / output links within DataLad datasets?*"

---

# Summary

#### ❓ "*Do you primarily work on Tardis? What is there to consider?*"

- DataLad works on Tardis as it works on your computer (it's not really different)
- With the dataset installed on both locations, you can flexibly update data back-and-forth

#### ❓ "*Do you keep data on Tardis temporally until all analyses are completed?*"

- Yes, just because it's convenient 😇
- You can always `datalad drop` contents from Tardis at any time (if you can retrieve them from elsewhere)

#### ❓ "*How do you manage input / output links within DataLad datasets?*"

- Ideally, datasets are self-contained (cf. [nesting](https://handbook.datalad.org/en/latest/basics/101-106-nesting.html) and [YODA](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html))
- Depends on your coding (see e.g., `here` in R, [here](https://github.com/jennybc/here_here) and [here](https://here.r-lib.org/))

---

# A basic workflow for DataLad on Tardis

1\. Create a new dataset directly on Tardis: `datalad create my_dataset`
  
--
  
2\. Start on your computer and move to Tardis
  - Create a dataset on your computer: `datalad create my_dataset`
  - Push the dataset to your hosting service (e.g., [GIN](https://gin.g-node.org/)): `datalad push --to gin`
  - Clone the dataset to Tardis: `datalad clone gin.g-node.org//my_username/my_dataset` (using SSH)

Moving back-and-forth between your computer and Tardis:
1. Run analysis on Tardis
1. Save changes (either on your computer or on Tardis): `datalad save -m "superduper changes"`
1. Push changes to your hosting service (e.g., [GIN](https://gin.g-node.org/)): `datalad push --to gin`
1. Update the clone (either on your computer or on Tardis): `datalad update --merge -s gin`
1. (Optional: Check if your repo is at the correct commit: `git log` / `git log --oneline -n 1`)
1. (Optional: Drop contents of the previous commit: `datalad drop .`)
1. Get the updated contents of the new commit: `datalad get .`

---

# DataLad on an HPC: Further reading

- see preprint ["FAIRly big: A framework for computationally reproducible
processing of large-scale data"](https://www.biorxiv.org/content/10.1101/2021.10.12.464122v1.full.pdf) by Wagner et al.
- see [DataLad on High Throughput or High Performance Compute Clusters](http://handbook.datalad.org/en/latest/beyond_basics/101-169-cluster.html) in the DataLad Handbook

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/clone_local.svg" alt="&lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;Clone from local&lt;/sup&gt;&lt;/a&gt;" width="95%" />
<p class="caption"><a href="http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html" target="_blank"><sup>Clone from local</sup></a></p>
</div>
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/clone_server.svg" alt="&lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;Clone from server / cluster&lt;/sup&gt;&lt;/a&gt;" width="95%" />
<p class="caption"><a href="http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html" target="_blank"><sup>Clone from server / cluster</sup></a></p>
</div>
]

---

# 💡 Idea: Clone datasets from local

```bash
├── zoo-bids
├── zoo-fmriprep
  └── inputs
    └── bids
└── zoo-mriqc
  └── inputs
    └── bids
```

#### Add `zoo-bids` as an input to `zoo-fmriprep` and `zoo-mriqc`

1\. Clone from hosting service (e.g., GIN), add a `local` sibling:
  - `datalad clone --dataset . git@gin.g-node.org:/lnnrtwttkhn/zoo-bids inputs/bids`
  - `datalad siblings add --name local --url ../../../zoo-bids`

2\. Clone from local (and add GIN remote later):
  - `datalad clone --dataset . ../zoo-bids inputs/bids`
  - `datalad siblings add --name gin --url git@gin.g-node.org:/lnnrtwttkhn/zoo-bids`

Getting data from local will be *much* faster! 🔥

---

# Workflow: Data Sharing with DataLad

---

### ❓ "*Did you notice any institute-specific peculiarities regarding infrastructure? *"

---

# Summary

#### ❓ "*Did you notice any institute-specific peculiarities regarding infrastructure? *"

1\. We are still lacking *MPIB-hosted* infrastructure to host DataLad datasets
  - We have a [GitLab](https://git.mpib-berlin.mpg.de/explore/projects) instance to host dataset *structure*
  - We have an *experimental* [in-house GIN instance](http://gin.mpib-berlin.mpg.de/) with 5TB that can also host annexed data
  - We have [KEEPER](https://keeper.mpdl.mpg.de/accounts/login/?next=/) (cloud-sharing service with 1TB by the [MPDL](https://www.mpdl.mpg.de/en/)) which can be configured as a [DataLad special remote](http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html#the-common-case-repository-hosting-without-annex-support-and-special-remotes)

&rarr; Flexible in-house storage space is needed!

---

# Share version-controlled datasets with DataLad

- With DataLad, you can share data like you share code
- DataLad datsets can be cloned, pushed and updated from and to remote hosting
services

---

# Interoperability with a range of hosting services

DataLad is built to maximize interoperability with a wide range of hosting services and storage technologies

.center[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#http://handbook.datalad.org/en/latest/_images/publishing_network_publishparts2.svg" alt="see &lt;a href=&quot;http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html&quot; target=&quot;_blank&quot;&gt;DataLad Handbook: Beyond shared infrastructure&lt;/a&gt;" width="55%" />
<p class="caption">see <a href="http://handbook.datalad.org/en/latest/basics/101-138-sharethirdparty.html" target="_blank">DataLad Handbook: Beyond shared infrastructure</a></p>
</div>
]

---

# Data sharing via GIN

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://gin.g-node.org/img/favicon.png" alt="&lt;a href=&quot;https://gin.g-node.org/&quot; target=&quot;_blank&quot;&gt;https://gin.g-node.org/&lt;/a&gt;" width="10%" />
<p class="caption"><a href="https://gin.g-node.org/" target="_blank">https://gin.g-node.org/</a></p>
</div>

> "*GIN is [...] a web-accessible repository store of your data based on git and git-annex that you can access securely anywhere you desire while keeping your data in sync, backed up and easily accessible [...]"*

#### Advantages of GIN (non-exhaustive list)

- free and open-source (could be hosted within MPIs / MPG)
- supports private and public repositories
- publicly funded by the Federal Ministry of Education and Research (BMBF)
- servers are on German land (near Munich, Germany)
- provides Digital Object Identifiers (DOIs) (details [here](https://gin.g-node.org/G-Node/Info/wiki/DOI))
- allows to set own license (details [here](https://gin.g-node.org/G-Node/Info/wiki/Licensing))
- DataLad plays perfectly with GIN, since both use git + git-annex (details [here](https://handbook.datalad.org/en/latest/basics/101-139-gin.html))

---

# Publishing a DataLad dataset to GIN in only 4 steps

1\. Create a dataset

```bash
datalad create my_dataset
```

2\. Save data into the dataset

```bash
datalad save -m "add data to dataset"
```

3\. Add the GIN remote ("sibling")

```bash
datalad siblings add -d . --name gin --url git@gin.g-node.org:/my_username/my_dataset.git
```

4\. Transfer the dataset to GIN

```bash
datalad push --to gin
```

Done!<sup>1</sup> 🎉

<sup><sup>1</sup> To be fair, it's a bit more complex than that ... 😇 (details [here](https://handbook.datalad.org/en/latest/basics/101-139-gin.html))</sup>

---

# Data sharing on Keeper

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://keeper.mpdl.mpg.de/media/img/catalog/KEEPER_logo.png" alt="&lt;a href=&quot;https://keeper.mpdl.mpg.de/&quot; target=&quot;_blank&quot;&gt;https://keeper.mpdl.mpg.de/&lt;/a&gt;" width="35%" />
<p class="caption"><a href="https://keeper.mpdl.mpg.de/" target="_blank">https://keeper.mpdl.mpg.de/</a></p>
</div>

> "*A free service for all Max Planck employees and project partners with **more than 1TB of storage per user** for your researchdata.
> Profit from safe data storage, seamlessly integrated into your research workflow.*"

- \> 1 TB per Max Planck employee
- data hosted on MPS servers
- configurable as a [DataLad special remote](http://handbook.datalad.org/en/latest/basics/101-139-dropbox.html)

... and after some confuguration:

```bash
datalad push --to keeper
```

<!-----

#### Suggested alternatives to GIN that can be used with DataLad (selection):

- [Keeper](https://keeper.mpdl.mpg.de/) (Seafile) offers all Max Planck employees 1TB(!) of storage (expandable)
- [Open Science Framework (OSF)](https://osf.io/), popular in Psychology / Cognitive Neuroscience (see [details](http://docs.datalad.org/projects/osf/en/latest/))
-->

---

# Workflow: Project Management

---

# Project management next to your data and code

#### Project infrastructure on hosting services (GitLab / GitHub)

- Discuss and plan your work in [issues](https://docs.gitlab.com/ee/user/project/issues/)
- Propose changes to code or data using [merge requests](https://docs.gitlab.com/ee/user/project/merge_requests/)
- Manage access to your code and data with detailed [permissions and roles](https://docs.gitlab.com/ee/user/permissions.html)
- Add documentation to your repo or in a separate [wiki](https://docs.gitlab.com/ee/user/project/wiki/)

#### GitLab for Max Planck employees

- hosted by GWDG: https://gitlab.gwdg.de/users/sign_in
- hosted by your institute<sup>1</sup>, e.g., at MPIB: https://git.mpib-berlin.mpg.de

---

# Discuss ideas and plan your work: Issues

.pull-left[
<img src="data:image/png;base64,#https://gitlab.pavlovia.org/help/user/project/issues/img/new_issue.png" width="100%" />

#### Example

Open a new issue in our `highspeed` [project repository](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/issues/new?issue)
]

.pull-right[
#### Elements of a new issue (details [here](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#elements-of-the-new-issue-form))

- **Description:** Markdown + HTML support, task lists, etc.
- **Confidentiality**: Issue visible only for team members
- **Assignee**: Assign responsibilities to team members
- **Milestone**: Add issues to important milestones
- **Labels**: Organize issues by labels, e.g., `bug`
- **Due date**: Set due dates for issues

#### More functions of issues (details [here](https://docs.gitlab.com/ee/user/project/issues/))

- Issues can be combined in [issue boards](https://docs.gitlab.com/ee/user/project/issue_board.html)
- Issues can be [sorted](https://docs.gitlab.com/ee/user/project/issues/sorting_issue_lists.html) (by due date, label priority, etc.)
- Issues can be [transferred between repositories](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#moving-issues)
- Issues can be [crosslinked](https://docs.gitlab.com/ee/user/project/issues/crosslinking_issues.htm) e.g., in commit messages: `git commit -m "add missing data, close #37"`
- Issues can send [automated email notifications](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#new-issue-via-email)
]

---

# Proposing changes: Merge / pull requests

.pull-left[
1. Clone the repository (i.e., "download the project")
1. Switch to a new branch (i.e., "start a separate version")
1. Make changes to the files and push the new version
1. Open a merge / pull request
{{content}}
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://zenodo.org/record/3678226/files/Contributing.jpg?download=1" alt="&lt;a href=&quot;https://zenodo.org/record/3695300&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;by Scriberia for The Turing Way community (CC-BY 4.0)&lt;/sup&gt;&lt;/a&gt;" width="80%" />
<p class="caption"><a href="https://zenodo.org/record/3695300" target="_blank"><sup>by Scriberia for The Turing Way community (CC-BY 4.0)</sup></a></p>
</div>
]

The maintainer (you) can ...
- see what was changed when by whom
- add changes to the merge request
- run (automated) checks on the contribution
{{content}}

**Examples:**
- Your supervisor proposes changes in your manuscript
- A collaborator adds new data to your dataset
- A colleague fixed several bugs in your analysis pipeline

---

# Workflow: Code and Data Presentation

---

# Project website with main statistical results

#### From Wittkuhn & Schuck, 2021, *Nature Communications* (see [Code Availability statement](https://www.nature.com/articles/s41467-021-21970-2#code-availability)):

> "*We share all code used in this study. An overview of all the resources is publicly available on our **project website.**"*

Project website publicly available at https://wittkuhn.mpib.berlin/highspeed/

#### Reproducible reports with [Bookdown](https://bookdown.org/yihui/bookdown/) / [RMarkdown](https://bookdown.org/yihui/rmarkdown/)

> *"R Markdown is a file format for making dynamic documents with R. An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code [...]"*

- Project documentation and main statistical analyses are written in RMarkdown (see [here](https://github.com/lnnrtwttkhn/highspeed-analysis/tree/master/code))
- Documentation pages showcase non-executed code (used in subdatasets) in Python and Bash
- Statistical analyses are executed and website rendered automatically via [Continuous Integreation / Deployment (CI/CD)](https://docs.gitlab.com/ee/ci/):
  1. In the [main project repository](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed), all RMarkdown files are [combined](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/_bookdown.yml#L22-36) using [bookdown](https://bookdown.org/) (across subdatasets)
  1. Input data is [automatically retrieved](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.gitlab-ci.yml#L5-75) from GIN and / or Keeper using DataLad (run in a [Docker container](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.docker/datalad/Dockerfile))
  1. The RMarkdown files are [run in Docker](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.docker/bookdown/Dockerfile) (executing main statistical analyses) and [rendered](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.gitlab-ci.yml#L99) into a static website
  1. The static website is [deployed to GitLab pages](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/blob/master/.gitlab-ci.yml#L95-106)

&rarr; This pipeline is automatically triggered on every push (change) to the main repository.

---

# Workflow: Software Containers

---

# Software containers and virtual environments

#### Software containers

#### Virtual environments (e.g., [in Python](https://docs.python.org/3/tutorial/venv.html))

```bash
pip freeze > requirements.txt
```

---

# Discussion

---

# Summary, outlook, challenges and discussions

> *"He [Jon Claerbout] has also pointed out that we have reached a point where solutions are available - it is now possible to publish computational research that is really reproducible by others.*"

Buckheit & Donoho (1995), describing how Jon Claerbout and his team shared CD-ROMs with interactive code that could regenerate the figures in their books - in the *early 90s* *(addition in parantheses)*

#### The technical solutions are already available!

- Code and data management using Git / DataLad
- Reproducible computational environments using software containers
- Reliance on community standards for data organization and code style guides
- Project management via issue boards and pull / merge requests on GitHub / GitLab
- Focus education on reproducible science and technical skills

#### The long-term challenges are largely non-technical:
- moving towards a "culture of reproducibility" (cf. Russ Poldrack, see e.g., [this talk](https://www.youtube.com/watch?v=XjW3t-qXAiE))
- changing incentives / funding schemes
- education, education, education
- implementing "slow science" (see e.g., [Frith, 2020, *TICS*](https://doi.org/10.1016/j.tics.2019.10.007))

---

# Take-home message

#### Science like open-source software development

We have all the tools we need to make science fully transparent and reproducible available today!

#### "*What can I do to get started?*"

1. Learn about Git
1. Learn about DataLad
1. Learn about software containers

... and reproducible you may be! ✨

---

# Overview of learning resources

#### Learn Git

- ["Pro Git"](https://git-scm.com/book/en/v2) by Scott Chacon and Ben Straub
- ["Happy Git and GitHub for the useR"](https://happygitwithr.com/) by Jenny Bryan, the STAT 545 TAs, Jim Hester
- ["Version Control"](https://the-turing-way.netlify.app/reproducible-research/vcs.html), by The Turing Way
- ["Version Control with Git"](https://swcarpentry.github.io/git-novice/) by The Software Carpentries

#### Learn DataLad

- ["Datalad Handbook"](http://handbook.datalad.org/en/latest/) by the DataLad team
- ["Research Data Management with DataLad"](https://www.youtube.com/playlist?list=PLEQHbPfpVqU5sSVrlwxkP0vpoOpgogg5j) | Recording of a full-day workshop on YouTube
- [Datalad on YouTube](https://www.youtube.com/c/DataLad) | Recorded workshops, tutorials and talks on DataLad

---

# Thank you!

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://schucklab.gitlab.io/img/group_photo.png" alt="&lt;a href=&quot;https://schucklab.gitlab.io/&quot; target=&quot;_blank&quot;&gt;Schuck lab&lt;/a&gt;" width="60%" />
<p class="caption"><a href="https://schucklab.gitlab.io/" target="_blank">Schuck lab</a></p>
</div>
]

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://www.mpg.de/assets/og-logo-8216b4912130f3257762760810a4027c063e0a4b09512fc955727997f9da6ea3.jpg" alt="&lt;a href=&quot;https://www.mpg.de/en&quot; target=&quot;_blank&quot;&gt;Max Planck Society&lt;/a&gt;" width="50%" />
<p class="caption"><a href="https://www.mpg.de/en" target="_blank">Max Planck Society</a></p>
</div>
]

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://secure.gravatar.com/avatar/f49adcdd1c7bb710cdf529ab916c3098?s=800&d=identicon" alt="&lt;a href=&quot;https://www.mpib-berlin.mpg.de/mitarbeiter/michael-krause&quot; target=&quot;_blank&quot;&gt;Michael Krause&lt;/a&gt;" width="30%" />
<p class="caption"><a href="https://www.mpib-berlin.mpg.de/mitarbeiter/michael-krause" target="_blank">Michael Krause</a></p>
</div>
]

.pull-left[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://www.repronim.org/images/logo-square-256.png" alt="&lt;a href=&quot;https://www.repronim.org/&quot; target=&quot;_blank&quot;&gt;ReproNim&lt;/a&gt;" width="30%" />
<p class="caption"><a href="https://www.repronim.org/" target="_blank">ReproNim</a></p>
</div>
]

---

# Discussion about potential limitations

#### "*That's too technical*"

- Many research fields become increasingly data-intense and computation-heavy
- Computational / programming skills are increasingly sought-after in the (non-)academic job market
- Focus on education and technical support

#### "*Why these tools? Can't we use something else?*"

- Git is used by software developers for way more than a decade
- GitHub has > 50 million users worldwide
- These tools are well-established, open-source, free to use and **available today**

---

# Appendix: Resources

---

# Introduction: Motivation for Reproducibility

---

# Motivation: "Open" Science should just be "Science"

.pull-left[
*"An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship.
The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures."*

Buckheit & Donoho (1995), paraphrasing Jon Claerbout
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://wiki.seg.org/images/b/b0/Jon_Claerbout_headshot.jpg" alt="&lt;a href=&quot;https://wiki.seg.org/wiki/Jon_Claerbout&quot; target=&quot;_blank&quot;&gt;Jon Claerbout&lt;/a&gt;&lt;/br&gt;Geophysicist at Stanford University&lt;/br&gt;(CC-BY-SA)" width="50%" />
<p class="caption"><a href="https://wiki.seg.org/wiki/Jon_Claerbout" target="_blank">Jon Claerbout</a></br>Geophysicist at Stanford University</br>(CC-BY-SA)</p>
</div>
]

???

- Jon Claerbout, a distinguished exploration geophysicist at Stanford University
- He has also pointed out that we have reached a point where solutions are available - it is now possible to publish computational research that is really reproducible by others.

---
exclude: true

# Good Scientific Practice?

#### Excerpt from the "Rules of Good Scientific Practice" by the Max Planck Society (November 24, 2000)

> *"Scientific examinations, experiments and numerical calculations can only be reproduced or reconstructed if all the important steps are comprehensible.
> For this reason, full and adequate reports are necessary, and these reports must be kept for a minimum period of ten years, not least as a source of reference, should the published results be called into question by others."* <sup>1</sup>

.footnote[
<sup>1</sup> Full PDF available [here](https://www.mpg.de/16404553/rules-scientific-practice.pdf)
]

--
exclude: true

#### Excerpt from your emplyoment contract

> *"The rules for safeguarding good scientific practice of the Max Planck Society dated November 24, 2000 in its current version* [👆] *are part of the employment contract."*

--
exclude: true

Do we meet these standards?

---

# Reproducible research

> *"[...] when the same analysis steps performed on the same dataset consistently produces the same answer."*

???

---

# Challenges: Many stages in the research cycle

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://keeper.mpdl.mpg.de/f/3a1863ac2c2e40809c5f/?dl=1" alt="&lt;a href=&quot;https://zenodo.org/record/4906004&quot; target=&quot;_blank&quot;&gt;&lt;sup&gt;by Scriberia for The Turing Way community (CC-BY 4.0)&lt;/sup&gt;&lt;/a&gt;" width="58%" />
<p class="caption"><a href="https://zenodo.org/record/4906004" target="_blank"><sup>by Scriberia for The Turing Way community (CC-BY 4.0)</sup></a></p>
</div>

---

# Challenges: Interaction between data and code

???

- Data is produced through code (e.g., task code)
- Data is manipulated by code and new data is generated
 - Mapping between input and output data
- This happens using specific software in specific versions

---

# Challenge: Documentation of methods and provenance

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://www.openuphub.eu/media/zoo/images/Sidney%20Harris_60c1243bb770a33f55ab7b012ff3e6dd.jpg" alt="&amp;copy; Sidney Harris" width="60%" />
<p class="caption">&copy; Sidney Harris</p>
</div>

???

- provide information on how data came into existence
- change data through documented code, not manually
- relate changes in data to changes in code

---

# How scientists save important data?

.pull-left[
<blockquote class="twitter-tweet" width="280" align="center" data-theme="light"><p lang="en" dir="ltr">*Seen*! 😂This wonderful graphic on saving important data resonates with us on every possible level! 😃😁🤗How about you? By <a href="https://twitter.com/ErrantScience?ref_src=twsrc%5Etfw">@ErrantScience</a> = <a href="https://twitter.com/MCeeP?ref_src=twsrc%5Etfw">@MCeeP</a> &amp; <a href="https://twitter.com/MichelleAReeve?ref_src=twsrc%5Etfw">@MichelleAReeve</a> <a href="https://t.co/0JaI6iTcNW">pic.twitter.com/0JaI6iTcNW</a></p>&mdash; Max Planck Society (@maxplanckpress) <a href="https://twitter.com/maxplanckpress/status/1431500205044781056?ref_src=twsrc%5Etfw">August 28, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 
]

.pull-right[
<img src="data:image/png;base64,#https://pbs.twimg.com/media/E920a0eWYAENA1V?format=jpg&name=medium" width="90%" style="display: block; margin: auto;" />
]

---

# "Practice" of research code and data management

- "*Where is the data?*"

- "*Can I see your code?*"

- "*Which version of the code and data did I use to produce this result?*"

- "*What is the difference between `data_version1_edit.csv` and `data_version8_new_final.csv`?*"

- "*Where did you get this file / code from?*"

- "*I get different results on my machine ...*"

- "*But it worked when I ran it last month?!*"

- *"Which value did you set for the input of this function?"*

---

# The solution?

> **Organize science like open-source software (OSS) development**

#### **The tools already exist!**

1. **Version-control** and **dependency management**

- Code, data and computational environments change all the time!
  - Example: Running the same analysis on your laptop, the cluster, or your collaborator's computer
  - Known solutions: Version-control (e.g., [Git](https://git-scm.com/), [DataLad](https://www.datalad.org/)) and software containers ([Docker](https://www.docker.com/), [Singularity](https://singularity.hpcng.org/))

2. **Collaboration, communication, acknowledgement and contribution**

- Raising questions, reporting errors, suggesting ideas via [issues](https://docs.github.com/en/issues/tracking-your-work-with-issues/creating-issues/about-issues)
  - Proposing, discussing, and reviewing changes via [pull](https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) (GitHub) or [merge](https://docs.gitlab.com/ee/user/project/merge_requests/) (GitLab) requests
  - Services and infrastructure ([GitHub](https://github.com/), [GitLab](https://about.gitlab.com/), [GIN](https://gin.g-node.org/), [OSF](https://osf.io/) etc.) to share and release research products
  - Contributions (by individuals or projects) can be tracked and categorized

---

# Workflow: Version control using Git

---

# The need for  *proper* version-control in a nutshell

---

# The need for *proper* version-control in a nutshell

---

# What is version control?

.center[
- keep files organized
- keep track of changes
- revert changes or go back to previous versions
]

---

# Version-control with Git

> Version control is a systematic approach to record changes made in a [...] set of files, over time. This allows you and your collaborators to track the history, see what changed, and recall specific versions later [...] ([Turing Way](https://the-turing-way.netlify.app/reproducible-research/vcs.html))

.pull-left[
#### Basic versioning workflow 
1. Create files (text, code, etc.)
1. Work on the files (change, delete or add new content)
1. **Create a snapshot of the file status** (a "commit")
{{content}}
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://git-scm.com/book/en/v2/images/distributed.png" alt="&lt;a href=&quot;https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control&quot; target=&quot;_blank&quot;&gt;Figure 3: Distributed Version Control Systems&lt;/a&gt;" width="75%" />
<p class="caption"><a href="https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control" target="_blank">Figure 3: Distributed Version Control Systems</a></p>
</div>
]

#### Git

- most pupular **distributed version control system**
- free, [open-source](https://github.com/git) command-line tool
- started by [Linus Torvalds](https://en.wikipedia.org/wiki/Git#History) (creator of Linux) in 2005
- standard tool for any software developer
- Graphical User Interfaces (GUIs) exist, e.g., [GitKraken](https://www.gitkraken.com/)

???

- Back in the day: Software developers used BitKeeper to collaborate on code with colleagues
- Free access to BitKeeper was revoked after a company broke down in 2005
- A new solution was needed so Linus Torvalds coded it up
- First version after a couple of days

---

# The amazing superpowers of version-control

.pull-left[
#### Git as a distributed **version control** system
- keep track of changes in a directory (a "repository")
- take snapshots ("commits") of your repo at any time
- know the history of what was changed when by whom
- compare commits and go back to any previous state
- work on "branches" and flexibly "merge" them together

**save one file and all of its history instead of multiple versions of the same file**
{{content}}
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#https://keeper.mpdl.mpg.de/f/8fda5b269fef4d778007/?dl=1" alt="Screenshot of GitKraken" width="100%" />
<p class="caption">Screenshot of GitKraken</p>
</div>
]

#### Git as a **distributed** version control system
- "push" your repo to a "remote" location and share it
- host / share your repo on GitHub, GitLab or BitBucket
- work with others on the same files at the same time
- others can read / copy / edit and suggest changes
- make your repo public and openly share your work

---

# Git mini-tutorial: Create a repository and add content

1\. Open the Terminal / command line on your computer

2\. Create a new Git repository

```bash
$ git init my_project
Initialized empty Git repository in ~/my_project/.git/
```

3\. Move into the `my_project` directory using `cd` ("**c**hange **d**irectory")

```bash
$ cd my_project
```

4\. Create a `README.md` text file that contains the line `hello world`, using `echo`

```bash
$ echo "hello world" >> README.md
```

5\. List the contents of the `my_project` directory, using `ls`

```bash
$ ls
README.md
```

---

# Git mini-tutorial: Track contents

6\. Tell Git to track the changes in the `README.md` file, using `git add`

```bash
$ git add README.md
```

7\. *Commit* the changes in the `README.md` file to your repository's history, using `git commit`

```bash
$ git commit --message "initial commit"
  [master (root-commit) 5118725] initial commit 
  1 file changed, 1 insertion(+) 
  create mode 100644 README.md
```

---

# Git mini-tutorial: Record changes over time

8\. Add another line to the `README.md` file, again using `echo`

```bash
$ echo "goodbye world" >> README.md
```

9\. Tell Git to also track this recent change, again using `git add`

```bash
$ git add README.md
```

10\. Commit this additional change to the history of the repository, again using `git commit`:

```bash
$ git commit -m "update README.md"
  [master c56c4c0] update README.md 
   1 file changed, 1 insertion(+) 
```

11\. Show the history of the repository using `git log`

```bash
$ git log --oneline
  c56c4c0 (HEAD -> master) update README.md 
  5118725 initial commit 
```

???

Show the current status of your repository using `git status`:

```bash
git status 
On branch master 
nothing to commit, working tree clean
```

---

# Random tips to help you keep track

- Use [tags](https://git-scm.com/book/en/v2/Git-Basics-Tagging) to mark the state ("commit") in your code and data repo that was used to generate the results in the paper

```bash
git tag -a v1.0 -m "version used to generate results in our paper"
```

- Use software containers, e.g., [Docker](https://www.docker.com/) or [Singularity](https://sylabs.io/guides/3.0/user-guide/index.html)

> Containers allow a researcher to package up a project with all of the parts it needs - such as libraries, dependencies, and system settings - and ship it all out as one package. Anyone can then open up a container and work within it, viewing and interacting with the project as if the machine they are accessing it from is identical to the machine specified in the container - regardless of what their computational environment actually is. They are designed to make it easier to transfer projects between very different environments.

[Turing Way](https://the-turing-way.netlify.app/reproducible-research/renv/renv-containers.html)

---

# Appendix: Challanges and solutions

---

# Challenge: Relationship between code and data

- *"Which code produced which data?"*
- *"In which order do I need to execute the code?"*

#### Example solutions

- [datalad run](http://docs.datalad.org/en/stable/generated/man/datalad-run.html)

> `datalad run` *"[...] will record a shell command, and save all changes this command triggered in the dataset – be that new files or changes to existing files."* (see [details](http://handbook.datalad.org/en/latest/basics/basics-run.html) in the DataLad handbook)

- [GNU Make](https://www.gnu.org/software/make/)

> *"Make enables [...] to build and install your package without knowing the details of how that is done -- because these details are recorded in the makefile that you supply."*

> *"Make figures out automatically which files it needs to update, based on which source files have changed. It also automatically determines the proper order for updating files [...]"*

---

# Challenge: Implementing a Data User Agreement (DUA)

#### From Wittkuhn & Schuck, 2021, project website (see section on [license information](https://wittkuhn.mpib.berlin/highspeed/#license-information)):

> "*If you download any of the published data, please complete our Data User Agreeement (DUA). The Data User Agreement (DUA) we use for this study, was taken from the Open Brain Consent project, distributed under Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0).*"

- based on templates and recommendations of the [Open Brain Consent](https://open-brain-consent.readthedocs.io/en/stable/) project (licensed [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en))
- optional for data from Wittkuhn & Schuck, 2021
- Statistics: *N* = 72 accessed the DUA, 0 completed
- not possible to implement mandatory DUA on GIN

---

# Appendix: Continuous integration

---

# Pros of continuous integration / deployment (CI/CD)

#### Figures and sourcedata always ready for download

> *"[...] we may request a source data file in Microsoft Excel format or a zipped folder. The source data file should, as a minimum, contain the raw data underlying any graphs and charts [...]"* (see [*Nat. Comms.* submission guidelines](https://www.nature.com/ncomms/submit/how-to-submit))

- Sourcedata and figures are created and saved during CI are [available for download](https://git.mpib-berlin.mpg.de/wittkuhn/highspeed/-/jobs/25521/artifacts/browse/highspeed-analysis/) (see [details](https://docs.gitlab.com/ee/ci/pipelines/job_artifacts.html))

---

# Appendix: DataLad Overview

---

# Appendix: DataLad YODA principles

---

# P1: *"One thing, one dataset"*

---

# P2: *"Record where you got it from, and where it is now"*

---

# P3: *"Record what you did to it, and with what"*

---

# DataLad: Resources, tutorials and teaching materials

- The [DataLad Handbook](http://handbook.datalad.org/en/latest/) is an incredibly extensive resource 
- YouTube video: ["What is DataLad"](https://www.youtube.com/watch?v=IN0vowZ67vs)
- YouTube video: Michael Hanke: ["How to introduce data management technology without sinking the ship?"](https://www.youtube.com/watch?v=uH75kYgwLH4)
- YouTube playlist: ["Research Data Management with DataLad"](https://www.youtube.com/playlist?list=PLEQHbPfpVqU5sSVrlwxkP0vpoOpgogg5j) (recording of full-day workshop)