Reproducible Research Data Management with DataLad

Joint Lab Meeting (Gluth, Schuck & Schwabe Labs) at University of Hamburg

Slides | Source

Dr. Lennart Wittkuhn

lennart.wittkuhn@uni-hamburg.de

Institute of Psychology, University of Hamburg, Germany

Max Planck Research Group NeuroCode, Max Planck Institute for Human Development, Berlin, Germany

Max Planck UCL Centre for Computational Psychiatry and Ageing Research, Berlin, Germany

November 28, 2024

About

Dr. Lennart Wittkuhn

lennart.wittkuhn@uni-hamburg.de
https://lennartwittkuhn.com/
Mastodon GitHub LinkedIn

About me

I am a Postdoctoral Researcher in Cognitive Neuroscience at the Institute of Psychology at the University of Hamburg (PI: Prof. Nicolas Schuck)

BSc Psychology & MSc Cognitive Neuroscience (TU Dresden), PhD Cognitive Neuroscience (Max Planck Institute for Human Development)

I study the role of fast neural memory reactivation in the human brain, applying machine learning and computational modeling to fMRI data

I am passionate about computational reproducibility, research data management, open science and tools that improve the scientific workflow

Find out more about my work on my website, Google Scholar and ORCiD

About this presentation

Slides: https://lennartwittkuhn.com/talk-uhh-rdm-2024

Source: https://github.com/lnnrtwttkhn/talk-uhh-rdm-2024

Software: Reproducible slides built with Quarto and deployed to GitHub Pages using GitHub Actions for continuous integration & deployment

License: Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)

Contact: Feedback or suggestions via email or GitHub issues. Thank you!

Acknowledgements and further reading

Slides and presentations by Dr. Adina Wagner and the DataLad team, e.g., “DataLad - Decentralized Management of Digital Objects for Open Science” (Wagner 2024)

DataLad Handbook
(CC BY-SA 4.0)

The DataLad Handbook by Wagner et al. (2022) is a comprehensive educational resource for data management with DataLad.

Papers

Wilson et al. (2014). Best practices for scientific computing. PLOS Biology.
Wilson et al. (2017). Good enough practices in scientific computing. PLOS Computational Biology.
Lowndes et al. (2017). Our path to better science in less time using open data science tools. Nature Ecology Evolution.

Talks

Richard McElreath (2020). Science as amateur software development. YouTube
Russ Poldrack (2020). Toward a Culture of Computational Reproducibility. YouTube

… and many more!

Agenda

0. Background

1. Scientific workflows with DataLad

1.1 Version Control

1.2 Modularity & Linking

1.3 Provenance

1.4 Collaboration & Interoperability

2. Integrating DataLad with University of Hamburg infrastructure

2.1 UHHCloud (NextCloud)

2.2 UHH Object Storage

2.3 UHH GitLab

3. Integrating DataLad with Max Planck Society infrastructure

3.1 Keeper

3.2 Edmond

3.3 ownCloud / nextCloud

3.4 GitLab

4. Integrating DataLad with third-party infrastructure

4.1 GIN

5. Summary & Discussion

Background

Computational reproducibility

The issue of computational reproducibility in science

“… when the same analysis steps performed on the same dataset consistently produce the same answer.” ¹

by Scriberia for The Turing Way Community (2022) (Link, CC BY 4.0)

The problem

about more than half of research is not reproducible ²
- research data, code, software & materials are often not available “upon reasonable [sic] request”
- if resources are shared, they are often incomplete
90% of researchers: “reproducibility crisis” (N = 1576) ³

Why?

computational reproducibility is hard
researchers lack training
incentives are not (yet) aligned ⁴
Scientific workflows are special ✨ (see next slides)

“… accumulated evidence indicates […] substantial room for improvement with regard to research practices to maximize the efficiency of the research community’s use of the public’s financial investment.” (Munafò et al. 2017)

💡 We need a professional toolkit for digital research!

Scientific building blocks are not static

We need version control

Why we need version control

… for code (text files)

… for data (binary files) © Jorge Cham (phdcomics.com)

If everything is relevant, track everything.

What is version control?

“Version control is a systematic approach to record changes made in a […] set of files, over time. This allows you and your collaborators to track the history, see what changed, and recall specific versions later […]” (Turing Way)

keep track of changes in a directory (a “repository”)

take snapshots (“commits”) of your repo at any time

know the history: what was changed when by whom

compare commits and go back to any previous state

work on parallel “branches” & flexibly “merge” them

by The Turing Way Community and Scriberia (2024) (CC BY 4.0)

“push” your repo to a “remote” location & share it

share repos on platforms like GitHub or GitLab

work together on the same files at the same time

others can read, copy, edit and suggest changes

make your repo public and openly share your work

What are Git and DataLad?

git-scm.com (by Jason Long; CC BY 3.0 Unported)

most popular version control system
free, open-source command-line tool
graphical user interfaces exist, e.g., GitKraken
standard tool in the software industry
100 million GitHub users ⁵

Sadly, Git does not handle large (binary) files well.

datalad.org (from the DataLad Handbook by Wagner et al. (2022); CC BY-SA 4.0)

“Git for (large) data”
free, open-source command-line tool
builds on top of Git and git-annex
allows to version control arbitrarily large datasets ⁶
DataLad Python API: Use DataLad in your Python code
graphical user interface exists: DataLad Gooey

Example Dataset: Brain Imaging Data

Single subject epoch (block) auditory fMRI activation data

mkdir neuro-data
wget https://www.fil.ion.ucl.ac.uk/spm/download/data/MoAEpilot/MoAEpilot.bids.zip \
-O neuro-data.zip
unzip neuro-data.zip -d neuro-data
rm neuro-data.zip
cd neuro-data
mv MoAEpilot/* .
rm -R MoAEpilot

Brain Imaging Data Structure (BIDS) Gorgolewski et al. (2016) (CC BY 4.0)

tree
.
├── CHANGES
├── README
├── dataset_description.json
├── sub-01
│   ├── anat
│   │   └── sub-01_T1w.nii
│   └── func
│       ├── sub-01_task-auditory_bold.nii
│       └── sub-01_task-auditory_events.tsv
└── task-auditory_bold.json

4 directories, 7 files

Dataset from Functional Imaging Laboratory, UCL Queen Square Institute of Neurology, London, UK (Source)

Version Control with DataLad

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)

datalad create neuro-data

View output

create(error): /tmp/neuro-data (dataset) [will not create a dataset in a non-empty directory, use `--force` option to ignore]

Rerun the command using --force:

datalad create --force neuro-data
create(ok): /tmp/neuro-data (dataset)

datalad save -m "save neuro data"

View output

add(ok): CHANGES (file)
add(ok): README (file)
add(ok): dataset_description.json (file)
add(ok): sub-01/anat/sub-01_T1w.nii (file)
add(ok): sub-01/func/sub-01_task-auditory_bold.nii (file)
add(ok): sub-01/func/sub-01_task-auditory_events.tsv (file)
add(ok): task-auditory_bold.json (file)
save(ok): . (dataset)                                              action summary:
  add (ok: 7)
  save (ok: 1)

Data in DataLad datasets are either stored in Git or git-annex

Git

handles small files well (text, code)
file contents are in Git history and will be shared
Shared with every dataset clone
Useful: small, non-binary, frequently modified files

git-annex

handles all types and sizes of files well
file contents are in the annex, not necessarily shared
Can be kept private on a per-file level
Useful: Large files, private files

Science is build from modular units

We need modularity and linking

Version control beyond single repositories

Research as a sequence

Prior works (code development, empirical data, etc.) are combined to produce results with goal of a publication
Aggregation across time and contributors
Aiming for (but often failing) to be reproducible
Often, there is one big project folder

A single repository is not enough!

Research as a cycle

Develop scientific outputs as modular but linked units
Independently update and develop data sources
Manage access to public / private datasets

Nesting of modular DataLad datasets

seamless nesting of modular datasets in hierarchical super-/sub-dataset relationships
based in Git submodules, but mono-repo feel thanks to recursive operations
overcomes scaling issues with large amounts of files (Example: Human Connectome Project)
modularizes research components for transparency, reuse and access management

Example: Intuitive data analysis structure

First, let’s create a new data analysis dataset:

datalad create -c yoda myanalysis
[INFO   ] Creating a new annex repo at /tmp/myanalysis
[INFO   ] Scanning for unlocked files (this may take some time)
[INFO   ] Running procedure cfg_yoda
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
create(ok): /tmp/myanalysis (dataset)

-c yoda initializes useful structure (details here):

tree
.
├── CHANGELOG.md
├── README.md
└── code
    └── README.md
2 directories, 3 files

We install analysis input data as a subdataset to the dataset:

datalad clone -d . https://github.com/datalad-handbook/iris_data.git input/
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
install(ok): input (dataset)
add(ok): input (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1)

input is a regular folder inside myanalysis

tree
.
├── CHANGELOG.md
├── README.md
├── code
│   └── README.md
└── input
    └── iris.csv
3 directories, 4 files

Modular units with clear provenance

git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..fc69c84
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,5 @@
+[submodule "input"]
+       path = input
+       url = https://github.com/datalad-handbook/iris_data.git
+       datalad-id = 5800e71c-09f9-11ea-98f1-e86a64c8054c
+       datalad-url = https://github.com/datalad-handbook/iris_data.git
diff --git a/input b/input
new file mode 160000
index 0000000..b9eb768
--- /dev/null
+++ b/input
@@ -0,0 +1 @@
+Subproject commit b9eb768c145e4a253d619d2c8285e540869d2021

We know exactly where the subdataset comes from
We know exactly which version of the subdataset is installed
We can develop and update each subdataset independently

Science is exploratory and iterative

We need provenance

Reusing previous work is hard

by The Turing Way Community and Scriberia (2024)

“Your number one collaborator is yourself from 6 months ago and they don’t answer emails.”

“Which version of which script produced these outputs from which version of which data?”

Establishing provenance with DataLad

datalad run wraps around anything expressed in a command line call and saves the dataset modifications resulting from the execution.

datalad rerun repeats captured executions. If the outcomes differ, it saves a new state of them.

datalad containers-run executes command line calls inside a tracked software container and saves the dataset modifications resulting from the execution.

datalad containers-run \
  --message "Time series extraction from Locus Coeruleus"
  --container-name nilearn \
  --input 'mri/*_bold.nii' \
  --output 'sub-*/LC_timeseries_run-*.csv' \
  "python3 code/extract_lc_timeseries.py"
 
-- Git commit --
  commit 5a7565a640ff6de67e07292a26bf272f1ee4b00e
  Author:     Adina Wagner adina.wagner@t-online.de
  AuthorDate: Mon Nov 11 16:15:08 2019 +0100

  [DATALAD RUNCMD] Time series extraction from Locus Coeruleus
  === Do not change lines below ===
  {
   "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
   "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
   "inputs": [
    "mri/*.bold.nii.gz",
    ".datalad/environments/nilearn.simg"
   ],
   "outputs": ["sub-*/LC_timeseries_run-*.csv"],
   ...
  }
  ^^^ Do not change lines above ^^^

Enshrine the analysis in a script and record code execution together with input data, output files and software environment in the execution command

Result: Machine readable record about which data, code and software produced a result how, when and why

Use the unique identifier (hash) of the execution record to have a machine recompute and verify past work

datalad rerun 5a7565a640ff6de67
[INFO   ] run commit 5a7565a640ff6de67; (Time series extraction from Locus Coeruleus)
[INFO   ] Making sure inputs are available (this may take some time)
get(ok): mri/sub-01_bold.nii (file)
        [...]
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): sub-01/LC_timeseries_run-*.csv(file)

Science is collaborative & distributed

We need interoperability & transport logistics

Data sharing and collaboration with DataLad

“I have a dataset on my computer.
How can I share it or collaborate on it?”

tree
.
├── CHANGES
├── README
├── dataset_description.json
├── sub-01
│   ├── anat
│   │   └── sub-01_T1w.nii
│   └── func
│       ├── sub-01_task-auditory_bold.nii
│       └── sub-01_task-auditory_events.tsv
└── task-auditory_bold.json

4 directories, 7 files

Challenge: Scientific workflows are idiosyncratic across institutions / departments / labs / any two scientists

Interoperability with a range of hosting services

DataLad is built to maximize interoperability and streamline routines across hosting services and storage technology

see DataLad Handbook: “Beyond shared infrastructure”

Separate content in Git vs. git-annex behind the scenes

DataLad datasets are exposed via private or public repositories on a repository hosting service (e.g., GitLab or GitHub)
Data can’t be stored in the repository hosting service but can be kept in almost any third party storage
Publication dependencies automate interactions between both paces

Special cases

Repositories with annex support

gin.g-node.org

Easy: Only one remote repository
Examples: GIN, GitLab with annex support

Special remotes with repositories

DataLad-OSF

Flexible: Full history or single snapshot
Examples: DataLad-OSF

Have access to more data than you have disk-space

Cloned datasets are lean.

datalad clone git@gin.g-node.org:/lnnrtwttkhn/neuro-data.git
install(ok): /tmp/neuro-data (dataset)
cd neuro-data && du -sh
212K

“Metadata” (file names, availability) are present …

tree
.
├── CHANGES
├── README
├── dataset_description.json
├── sub-01
│   ├── anat
│   │   └── sub-01_T1w.nii
│   └── func
│       ├── sub-01_task-auditory_bold.nii
│       └── sub-01_task-auditory_events.tsv
└── task-auditory_bold.json

4 directories, 7 files

… but no file content:

open README
The file /tmp/README does not exist.

File contents can be retrieved on demand:

datalad get .
get(ok): CHANGES (file) [from origin...]
get(ok): README (file) [from origin...]
get(ok): dataset_description.json (file) [from origin...]
get(ok): sub-01/anat/sub-01_T1w.nii (file) [from origin...]
get(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [from origin...]
get(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [from origin...]
action summary:
  get (ok: 6)

Let’s check the dataset size again:

du -sh
49M

Drop file content that is not needed:

datalad drop .
drop(ok): CHANGES (file) [locking origin...]
drop(ok): README (file) [locking origin...]
drop(ok): dataset_description.json (file) [locking origin...]
drop(ok): sub-01/anat/sub-01_T1w.nii (file) [locking origin...]
drop(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [locking origin...]
drop(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [locking origin...]
drop(ok): . (directory)
action summary:
  drop (ok: 7)

When files are dropped, only “metadata” stays behind, and files can be re-obtained on demand.

Sharing DataLad datasets via UHHCloud (Nextcloud)

cloud.uni-hamburg.de

DataLad NEXT extension allows to push / clone DataLad datasets to / from Nextcloud (via WebDAV)

UHHCloud: “UHH members have a standard quota of 5 terabytes each (students have 100 gigabytes).”

✨ Features of UHHCloud (Nextcloud) ✨

data privacy compliant alternative to Google Drive, Dropbox, etc. (usually hosted on-site)
provided by your institution, so free to use
supports private and public repositories
can be used together with external collaborators
expose datasets for regular download without DataLad

Create a WebDAV sibling:

datalad create-sibling-webdav --dataset . \
  --name uhhcloud --mode filetree \
  'https://cloud.uni-hamburg.de/remote.php/dav/files/USERNAME/neuro-data'

Push dataset to UHHCloud

datalad push --to uhhcloud

Access the dataset on cloud.uni-hamburg.de

Sharing DataLad datasets via UHH Object Storage (~ Amazon S3 Buckets)

UHH Object Storage (Cloudian)

aws.amazon.com/de/s3

See Chapter: “Amazon S3 as a special remote” on how to push / clone DataLad datasets to / from an Object Storage

✨ Features of the UHH Object Storage (Cloudian) ✨

“unlimited” storage for UHH employees
multiple backups across devices and locations
data privacy compliant alternative to Amazon S3 Buckets (hosted on-site, offering 99% compatibility)
provided by your institution, so free to use
supports private and public repositories
can be used together with external collaborators
expose datasets for regular download without DataLad

Create a Object Storage sibling:

git annex initremote uhh-object-storage type=S3 encryption=none \
bucket=neuro-data public=no datacenter=EU \
host=s3-uhh.lzs.uni-hamburg.de protocol=https port=443 autoenable=true

Push dataset to the UHH Object Storage

datalad push --to uhh-object-storage

Access the dataset the UHH Object Storage

Sharing DataLad datasets via GitLab

gitlab.com

“GitLab is open source software to collaborate on code. Manage git repositories with fine-grained access controls that keep your code secure.”

GitLab for UHH students and employees

hosted by UHH: gitlab.rrz.uni-hamburg.de

✨ Features of GitLab ✨

free to use and open-source
supports private and public repositories
use project management infrastructure (merge requests, issue boards, etc.) for your dataset projects

Create a GitLab sibling

datalad siblings add --dataset . --name gitlab \
--url git@gitlab.rrz.uni-hamburg.de:wittkuhn/neuro-data.git

Push dataset metadata to GitLab

datalad push --to gitlab

Sharing DataLad datasets via Keeper

keeper.mpdl.mpg.de

“A free service for all Max Planck employees and project partners with more than 1TB of storage per user for your researchdata.”

✨ Features of Keeper ✨

> 1 TB per Max Planck employee (and expandable):
based on cloud-sharing service Seafile
data hosted on MPS servers
configurable as a DataLad special remote

Configure rclone

rclone config create neuro-data seafile \
url https://keeper.mpdl.mpg.de/ user wittkuhn@mpib-berlin.mpg.de \
library neuro-data pass supersafepassword

Create a library on Keeper and a Keeper sibling

git annex initremote keeper type=external externaltype=rclone \
chunk=50MiB encryption=none target=neuro-data

Push dataset to Keeper

datalad push --to keeper

Sharing DataLad datasets via Edmond

edmond.mpg.de

“Edmond is a research data repository for Max Planck researchers. It is the place to store completed datasets of research data with open access.”

✨ Features of Edmond ✨

based on Dataverse, hosted on MPS servers
use is free of charge
no storage limitation (on datasets or individual files)
flexible licensing

Two modes:

annex mode (default): non-human readable representation of the dataset that includes Git history and annexed data
filetree mode: human readable single snapshot of your dataset “as it currently is” that does not include history of annexed files (but Git history)

Create a Dataverse sibling for Edmond:

datalad add-sibling-dataverse https://edmond.mpg.de/ \
doi:10.17617/3.8LDVXK --mode filetree

Push dataset to Edmond / Dataverse

datalad push --to dataverse

Sharing DataLad datasets via ownCloud / Nextcloud

owncloud.gwdg.de

nextcloud.com

DataLad NEXT extension allows to push / clone DataLad datasets to / from ownCloud & Nextcloud (via WebDAV)

ownCloud GWDG: “50 GByte default storage space per user; flexible increase possible upon request”

✨ Features of ownCloud and Nextcloud ✨

data privacy compliant alternative to Google Drive, Dropbox, etc. (usually hosted on-site)
provided by your institution, so free to use
supports private and public repositories
can be used together with external collaborators
expose datasets for regular download without DataLad

Create a WebDAV sibling:

datalad create-sibling-webdav --dataset . \
  --name owncloud-gwdg --mode filetree \
  'https://owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data'

Push dataset to ownCloud

datalad push --to owncloud-gwdg

Sharing DataLad datasets via GitLab

gitlab.com

“GitLab is open source software to collaborate on code. Manage git repositories with fine-grained access controls that keep your code secure.”

GitLab for Max Planck employees

hosted by GWDG: gitlab.gwdg.de
hosted by your institute, e.g., git.mpib-berlin.mpg.de

✨ Features of GitLab ✨

free to use and open-source
several MPS instances available (see above)
supports private and public repositories
use project management infrastructure (merge requests, issue boards, etc.) for your dataset projects

Create a GitLab sibling

datalad siblings add --dataset . --name gitlab \
--url git@git.mpib-berlin.mpg.de:wittkuhn/neuro-data.git

Push dataset metadata to GitLab

datalad push --to gitlab

Sharing DataLad datasets via GIN

gin.g-node.org

“GIN is […] a web-accessible repository store of your data based on git and git-annex that you can access securely anywhere you desire while keeping your data in sync, backed up and easily accessible […]“

✨ Features of GIN ✨

free to use and open-source (could be hosted within your institution; for more details, see here)
currently unlimited storage capacity and no restrictions on individual file size
supports private and public repositories
publicly funded by the Federal Ministry of Education and Research (BMBF; details here)
servers on German land (Munich, Germany; cf. GDPR)
provides Digital Object Identifiers (DOIs) (details here) and allows free licensing (details here)

Create a GIN sibling

datalad siblings add --dataset . \
--name gin --url git@gin.g-node.org:/lnnrtwttkhn/neuro-data.git

Push dataset to GIN

datalad push --to gin

Publish and consume datasets like source code

Datasets can comfortably live in multiple locations:

datalad siblings
.: here(+) [git]
.: owncloud-gwdg(+) [git]
.: dataverse(+) [dataverse]
.: gin(+) [https://gin.g-node.org/lnnrtwttkhn/neuro-data (git)]
.: keeper(+) [rclone]
.: gitlab(-) [git@git.mpib-berlin.mpg.de:wittkuhn/neuro-data.git (git)]

Publication dependencies automate update in all places:

datalad siblings configure --name gitlab --publish-depends SIBLING

Redundancy: DataLad gets data from available sources

Clone the dataset from GitLab:

datalad clone https://git.mpib-berlin.mpg.de/wittkuhn/neuro-data

Access to special remotes needs to be configured:

[INFO   ] access to 3 dataset siblings keeper, dataverse-storage,
owncloud-gwdg-storage not auto-enabled, enable with:
|       datalad siblings -d "/tmp/neuro-data" enable -s SIBLING

DataLad retrieves data from available sources (here, GIN):

datalad get .
get(ok): CHANGES (file) [from gin-src...]
get(ok): README (file) [from gin-src...]
[...]

Summary

Summary and discussion

Science is complex

Scientific units are not static: We need version control
Science is modular: We need to link modular datasets
Science is iterative: We need to establish provenance
Science is collaborative and distributed: We want to share our work and integrate with diverse infrastructure

DataLad: Decentralized management of digital objects for open science

DataLad can version control arbitrary datasets
DataLad links modular version-controlled datasets
DataLad establishes provenance and transparency
DataLad integrates with diverse infrastructure

Develop everything like source code

Code and data management using Git and DataLad (free, open-source command-line tools)
Code and data sharing via flexible repository hosting services (GitLab, GitHub, GIN, etc.)
Code and data storage on various infrastructure (GIN, OSF, S3, Keeper, Dataverse, and many more!)
Project-related communication (ideas, problems, discussions) via issue boards on GitLab / GitHub etc.
Transparent contributions to code and data via merge requests on GitLab (i.e., pull requests on GitHub)
Reproducible procedures using datalad run, rerun, and containers-run commands (also Make etc.)
Reproducible computational environments using software containers (e.g., Docker, Apptainer, etc.)

✨ Towards science as distributed, open-source ~~software~~ knowledge development ✨ (cf. McElreath, 2020, 2023)

Overview of learning resources

Learn Git

“Pro Git” by Scott Chacon & Ben Straub
“Happy Git and GitHub for the useR” by Jenny Bryan, the STAT 545 TAs & Jim Hester
“Version Control” by The Turing Way
“Version Control with Git” by The Software Carpentries
“Version control” (chapter 3 of “Neuroimaging and Data Science”) by Ariel Rokem & Tal Yarkoni

Learn DataLad

“Datalad Handbook” by the DataLad team / Wagner et al., 2022, Zenodo
“Research Data Management with DataLad” | Recording of a full-day workshop on YouTube
Datalad on YouTube | Recorded workshops, tutorials and talks on DataLad

Learn both (disclaimer: shameless plug 🙈)

Full-semester course on “Version control of code and data using Git and DataLad” at University of Hamburg (generously funded by the Digital and Data Literacy in Teaching Lab program) with many open educational resources (online guide, quizzes and exercises)

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.

Crüwell, Sophia, Deborah Apthorp, Bradley J. Baker, Lincoln Colling, Malte Elson, Sandra J. Geiger, Sebastian Lobentanzer, et al. 2023. “What’s in a Badge? A Computational Reproducibility Investigation of the Open Data Badge Policy in One Issue of Psychological Science.” Psychological Science 34 (4): 512–22. https://doi.org/10.1177/09567976221140828.

Gorgolewski, Krzysztof J., Tibor Auer, Vince D. Calhoun, R. Cameron Craddock, Samir Das, Eugene P. Duff, Guillaume Flandin, et al. 2016. “The Brain Imaging Data Structure, a Format for Organizing and Describing Outputs of Neuroimaging Experiments.” Scientific Data 3 (1). https://doi.org/10.1038/sdata.2016.44.

Hardwicke, Tom E., Manuel Bohn, Kyle MacDonald, Emily Hembacher, Michèle B. Nuijten, Benjamin N. Peloquin, Benjamin E. deMayo, Bria Long, Erica J. Yoon, and Michael C. Frank. 2021. “Analytic Reproducibility in Articles Receiving Open Data Badges at the Journal Psychological Science : An Observational Study.” Royal Society Open Science 8 (1). https://doi.org/10.1098/rsos.201494.

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1). https://doi.org/10.1038/s41562-016-0021.

Obels, Pepijn, Daniël Lakens, Nicholas A. Coles, Jaroslav Gottfried, and Seth A. Green. 2020. “Analysis of Open Data and Computational Reproducibility in Registered Reports in Psychology.” Advances in Methods and Practices in Psychological Science 3 (2): 229–37. https://doi.org/10.1177/2515245920918872.

Poldrack, Russell A. 2019. “The Costs of Reproducibility.” Neuron 101 (1): 11–14. https://doi.org/10.1016/j.neuron.2018.11.030.

The Turing Way Community. 2022. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research. Zenodo. https://doi.org/10.5281/zenodo.3233853.

The Turing Way Community, and Scriberia. 2024. “Illustrations from the Turing Way: Shared Under CC-BY 4.0 for Reuse,” January. https://doi.org/10.5281/ZENODO.3332807.

Wagner, Adina S. 2024. “DataLad: Decrentalized Management of Digital Objects for Open Science.” Zenodo, January. https://doi.org/10.5281/ZENODO.10556597.

Wagner, Adina S., Laura K. Waite, Kyle Meyer, Marisa K. Heckner, Tobias Kadelka, Niels Reuter, Alexander Q. Waite, et al. 2022. “The DataLad Handbook,” April. https://doi.org/10.5281/ZENODO.6463273.

Wicherts, Jelte M., Denny Borsboom, Judith Kats, and Dylan Molenaar. 2006. “The Poor Availability of Psychological Research Data for Reanalysis.” American Psychologist 61 (7): 726–28. https://doi.org/10.1037/0003-066x.61.7.726.

Thank you!

Dr. Lennart Wittkuhn

lennart.wittkuhn@uni-hamburg.de
https://lennartwittkuhn.com/
Mastodon GitHub LinkedIn

Slides: https://lennartwittkuhn.com/talk-uhh-rdm-2024

Source: https://github.com/lnnrtwttkhn/talk-uhh-rdm-2024

Software: Reproducible slides built with Quarto and deployed to GitHub Pages using GitHub Actions for continuous integration & deployment

License: Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)

Contact: Feedback or suggestions via email or GitHub issues. Thank you!

Appendix

Example: “Let me just quickly copy those files …”

Without datalad run

Researcher writes some Python code to copy files:

for sourcefile, dest in zip(glob(path_source), glob(path_dest)):
  destination = path.join(dest, Path(sourcefile).name)
  shutil.move(sourcefile, destination)

glob does not sort! 😱

source/
├── sub-01
│   └── sub-01-events.tsv
├── sub-02
│   └── sub-02-events.tsv
├── sub-03
│   └── sub-03-events.tsv
├── sub-04
│   └── sub-04-events.tsv
[...]

destination/
├── sub-01
│   └── sub-03-events.tsv
├── sub-02
│   └── sub-01-events.tsv
├── sub-03
│   └── sub-04-events.tsv
├── sub-04
│   └── sub-02-events.tsv
[...]

Researcher shares analysis with collaborators.

With datalad run

Researcher uses datalad-run to copy files:

$ datalad run -m "Copy event files" \
"for sub in eventfiles;
    do mv ${sub}/events.tsv analysis/${sub}/events.tsv;
done"

empty

Walkthrough: Sharing DataLad datasets via Keeper

Configure rclone:

rclone config
2024/03/19 11:45:32 NOTICE: Config file "/root/.config/rclone/rclone.conf" not found - using defaults
No remotes found, make a new one?
n) New remote
s) Set configuration password
q) Quit config
name> neuro-data

Option Storage.
Type of storage to configure.
Choose a number from below, or type in your own value.
 1 / 1Fichier
   \ (fichier)
 2 / Akamai NetStorage
   \ (netstorage)
 3 / Alias for an existing remote
   \ (alias)
 4 / Amazon S3 Compliant Storage Providers including AWS, Alibaba, ArvanCloud, Ceph, ChinaMobile, Cloudflare, DigitalOcean, Dreamhost, GCS, HuaweiOBS, IBMCOS, IDrive, IONOS, LyveCloud, Leviia, Liara, Linode, Minio, Netease, Petabox, RackCorp, Rclone, Scaleway, SeaweedFS, StackPath, Storj, Synology, TencentCOS, Wasabi, Qiniu and others
   \ (s3)
 5 / Backblaze B2
   \ (b2)
 6 / Better checksums for other remotes
   \ (hasher)
 7 / Box
   \ (box)
 8 / Cache a remote
   \ (cache)
 9 / Citrix Sharefile
   \ (sharefile)
10 / Combine several remotes into one
   \ (combine)
11 / Compress a remote
   \ (compress)
12 / Dropbox
   \ (dropbox)
13 / Encrypt/Decrypt a remote
   \ (crypt)
14 / Enterprise File Fabric
   \ (filefabric)
15 / FTP
   \ (ftp)
16 / Google Cloud Storage (this is not Google Drive)
   \ (google cloud storage)
17 / Google Drive
   \ (drive)
18 / Google Photos
   \ (google photos)
19 / HTTP
   \ (http)
20 / Hadoop distributed file system
   \ (hdfs)
21 / HiDrive
   \ (hidrive)
22 / ImageKit.io
   \ (imagekit)
23 / In memory object storage system.
   \ (memory)
24 / Internet Archive
   \ (internetarchive)
25 / Jottacloud
   \ (jottacloud)
26 / Koofr, Digi Storage and other Koofr-compatible storage providers
   \ (koofr)
27 / Linkbox
   \ (linkbox)
28 / Local Disk
   \ (local)
29 / Mail.ru Cloud
   \ (mailru)
30 / Mega
   \ (mega)
31 / Microsoft Azure Blob Storage
   \ (azureblob)
32 / Microsoft Azure Files
   \ (azurefiles)
33 / Microsoft OneDrive
   \ (onedrive)
34 / OpenDrive
   \ (opendrive)
35 / OpenStack Swift (Rackspace Cloud Files, Blomp Cloud Storage, Memset Memstore, OVH)
   \ (swift)
36 / Oracle Cloud Infrastructure Object Storage
   \ (oracleobjectstorage)
37 / Pcloud
   \ (pcloud)
38 / PikPak
   \ (pikpak)
39 / Proton Drive
   \ (protondrive)
40 / Put.io
   \ (putio)
41 / QingCloud Object Storage
   \ (qingstor)
42 / Quatrix by Maytech
   \ (quatrix)
43 / SMB / CIFS
   \ (smb)
44 / SSH/SFTP
   \ (sftp)
45 / Sia Decentralized Cloud
   \ (sia)
46 / Storj Decentralized Cloud Storage
   \ (storj)
47 / Sugarsync
   \ (sugarsync)
48 / Transparently chunk/split large files
   \ (chunker)
49 / Union merges the contents of several upstream fs
   \ (union)
50 / Uptobox
   \ (uptobox)
51 / WebDAV
   \ (webdav)
52 / Yandex Disk
   \ (yandex)
53 / Zoho
   \ (zoho)
54 / premiumize.me
   \ (premiumizeme)
55 / seafile
   \ (seafile)
Storage> seafile

Option url.
URL of seafile host to connect to.
Choose a number from below, or type in your own value.
 1 / Connect to cloud.seafile.com.
   \ (https://cloud.seafile.com/)
url> https://keeper.mpdl.mpg.de/

Option user.
User name (usually email address).
Enter a value.
user> wittkuhn@mpib-berlin.mpg.de

Option pass.
Password.
Choose an alternative below. Press Enter for the default (n).
y) Yes, type in my own password
g) Generate random password
n) No, leave this optional password blank (default)
y/g/n> y
Enter the password:
password:
Confirm the password:
password:

Option 2fa.
Two-factor authentication ('true' if the account has 2FA enabled).
Enter a boolean value (true or false). Press Enter for the default (false).
2fa> false

Option library.
Name of the library.
Leave blank to access all non-encrypted libraries.
Enter a value. Press Enter to leave empty.
library> neuro-data

Option library_key.
Library password (for encrypted libraries only).
Leave blank if you pass it through the command line.
Choose an alternative below. Press Enter for the default (n).
y) Yes, type in my own password
g) Generate random password
n) No, leave this optional password blank (default)
y/g/n> n

Edit advanced config?
y) Yes
n) No (default)
y/n> n

Configuration complete.
Options:
- type: seafile
- url: https://keeper.mpdl.mpg.de/
- user: wittkuhn@mpib-berlin.mpg.de
- pass: *** ENCRYPTED ***
- library: neuro-data
Keep this "neuro-data" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y

Current remotes:

Name                 Type
====                 ====
neuro-data           seafile

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q

export KEEPER_PASSWORD=password
rclone config create neuro-data seafile url https://keeper.mpdl.mpg.de/ user wittkuhn@mpib-berlin.mpg.de library neuro-data pass $KEEPER_PASSWORD

git annex initremote keeper type=external externaltype=rclone chunk=50MiB encryption=none target=neuro-data

initremote keeper ok
(recording state in git...)

datalad siblings
.: here(+) [git]
.: keeper(+) [rclone]

datalad push --to keeper
copy(ok): CHANGES (file) [to keeper...]
copy(ok): README (file) [to keeper...]
copy(ok): dataset_description.json (file) [to keeper...]
copy(ok): sub-01/anat/sub-01_T1w.nii (file) [to keeper...]
copy(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [to keeper...]
copy(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [to keeper...]
copy(ok): task-auditory_bold.json (file) [to keeper...]            action summary:
  copy (ok: 7)

Walkthrough: Sharing DataLad datasets via Edmond

If you want to publish a dataset to Dataverse, you will need a dedicated location on Dataverse that we will publish our dataset to. For this, we will use a Dataverse dataset⁷.

Go to Edmond, log in, and create a new draft Dataverse dataset via the Add Data header
The New Dataset button takes you to a configurator for your Dataverse dataset. Provide all relevant details and metadata entries in the form⁸. Importantly, don’t upload any of your data files - this will be done by DataLad later.
Once you have clicked Save Dataset, you’ll have a draft Dataverse dataset. It already has a DOI, and you can find it under the Metadata tab as “Persistent identifier”:
Finally, make a note of the URL of your dataverse instance (e.g., https://edmond.mpg.de/), and the DOI of your draft dataset. You will need this information for step 3.

Add a Dataverse sibling to your dataset

We will use the datalad add-sibling-dataverse command. This command registers the remote Dataverse Dataset as a known remote location to your Dataset and will allow you to publish the entire Dataset (Git history and annexed data) or parts of it to Dataverse.

datalad add-sibling-dataverse https://edmond.mpg.de/ doi:10.17617/3.KUKEKI

If you run this command for the first time, you will need to provide an API Token to authenticate against the chosen Dataverse instance in an interactive prompt. This is how this would look:

A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token: 
A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token (repeat): 
Enter a name to save the credential securely for future reuse, or 'skip' to not save the credential
name:

You’ll find this token if you follow the instructions in the prompt under your user account on your Dataverse instance, and you can copy-paste it into the command line.

add_sibling_dataverse.storage(ok): . [dataverse-storage: https://edmond.mpg.de/ (DOI: doi:10.17617/3.KUKEKI)]
[INFO   ] Configure additional publication dependency on "dataverse-storage"

A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token: 
A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token (repeat): 
Enter a name to save the credential securely for future reuse, or 'skip' to not save the credential
name: skip

add_sibling_dataverse(ok): . [dataverse: datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no&url=https%3A//edmond.mpg.de/&doi=doi:10.17617/3.KUKEKI (DOI: doi:10.17617/3.KUKEKI)]

As soon as you’ve created the sibling, you can push:

datalad push --to dataverse

copy(ok): CHANGES (file) [to dataverse-storage...]
copy(ok): README (file) [to dataverse-storage...]
copy(ok): dataset_description.json (file) [to dataverse-storage...]
copy(ok): sub-01/anat/sub-01_T1w.nii (file) [to dataverse-storage...]
copy(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [to dataverse-storage...]
copy(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [to dataverse-storage...]
copy(ok): task-auditory_bold.json (file) [to dataverse-storage...]
publish(ok): . (dataset) [refs/heads/main->dataverse:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->dataverse:refs/heads/git-annex [new branch]]

Walkthrough: Sharing DataLad datasets via ownCloud / nextCloud

Get the WebDAV address

Click on Settings (bottom left)
Copy the WebDAV address, for example: https://owncloud.gwdg.de/remote.php/nonshib-webdav/

datalad create-sibling-webdav \
  --dataset . \
  --name owncloud-gwdg \
  --mode filetree \
1  'https://owncloud.gwdg.de/remote.php/nonshib-webdav/<dataset-name>'

1: Replace <dataset-name> with the name of your dataset, i.e., the name of your dataset folder. In this example, we replace <dataset-name> with neuro-data. The complete command for your example hence looks like this:

datalad create-sibling-webdav \
  --dataset . \
  --name owncloud-gwdg \
  --mode filetree \
  'https://owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data'

You will be asked to provide your ownCloud account credentials:

1user:
2password:
3password (repeat):

1: Enter the email address of your ownCloud account.
2: Enter the password of your ownCloud account.
3: Repeat the password of your ownCloud account.

create_sibling_webdav.storage(ok): . [owncloud-gwdg-storage: https://owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data]
[INFO   ] Configure additional publication dependency on "owncloud-gwdg-storage" 
create_sibling_webdav(ok): . [owncloud-gwdg: datalad-annex::?type=webdav&encryption=none&exporttree=yes&url=https%3A//owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data]

Finally, we can push the dataset to ownCloud:

1datalad push --to owncloud-gwdg

1: Use datalad push to push the dataset contents to ownCloud. For details on datalad push, see the command line reference and this chapter in the DataLad Handbook.

We can now view the files on ownCloud and inspect them through the web browser.

Footnotes

The Turing Way Community (2022), see “Guide on Reproducible Research”
for example, in Psychology: Crüwell et al. (2023); Hardwicke et al. (2021); Obels et al. (2020); Wicherts et al. (2006)
see Baker (2016), Nature
see e.g., Poldrack (2019)
(Source: Wikipedia)
see DataLad dataset of 80TB / 15 million files from the Human Connectome Project (see details)
Dataverse datasets contain digital files (research data, code, …), amended with additional metadata. They typically live inside of dataverse collections.
At least, Title, Description, and Organization are required.

Reproducible Research Data Management with DataLad

About

Dr. Lennart Wittkuhn

About me

About this presentation

Acknowledgements and further reading

Papers

Talks

Agenda

0. Background

1. Scientific workflows with DataLad

2. Integrating DataLad with University of Hamburg infrastructure

3. Integrating DataLad with Max Planck Society infrastructure

4. Integrating DataLad with third-party infrastructure

5. Summary & Discussion

Background

The issue of computational reproducibility in science

The problem

Why?

Scientific building blocks are not static

Why we need version control

What is version control?

What are Git and DataLad?

Example Dataset: Brain Imaging Data

Version Control with DataLad

Data in DataLad datasets are either stored in Git or git-annex

Git

git-annex

Science is build from modular units

Version control beyond single repositories

Research as a sequence

Research as a cycle

Nesting of modular DataLad datasets

Example: Intuitive data analysis structure

Modular units with clear provenance

Science is exploratory and iterative

Reusing previous work is hard

Establishing provenance with DataLad

Science is collaborative & distributed

Data sharing and collaboration with DataLad

Share data like code

Interoperability with a range of hosting services

Separate content in Git vs. git-annex behind the scenes

Special cases

Repositories with annex support

Special remotes with repositories

Have access to more data than you have disk-space

Data sharing using DataLad and data infrastructure of the University of Hamburg

Sharing DataLad datasets via UHHCloud (Nextcloud)

✨ Features of UHHCloud (Nextcloud) ✨

Sharing DataLad datasets via UHH Object Storage (~ Amazon S3 Buckets)

✨ Features of the UHH Object Storage (Cloudian) ✨

Sharing DataLad datasets via GitLab

GitLab for UHH students and employees

✨ Features of GitLab ✨

Data sharing using DataLad and data infrastructure of the Max Planck Society

Sharing DataLad datasets via Keeper

✨ Features of Keeper ✨

Sharing DataLad datasets via Edmond

✨ Features of Edmond ✨

Two modes:

Sharing DataLad datasets via ownCloud / Nextcloud

✨ Features of ownCloud and Nextcloud ✨

Sharing DataLad datasets via GitLab

GitLab for Max Planck employees

✨ Features of GitLab ✨

Data sharing using DataLad and third-party data infrastructure

Sharing DataLad datasets via GIN

✨ Features of GIN ✨

Publish and consume datasets like source code

Summary

Summary and discussion

Science is complex

DataLad: Decentralized management of digital objects for open science

Develop everything like source code

Overview of learning resources

Learn Git

Learn DataLad

Learn both (disclaimer: shameless plug 🙈)

References