Reproducible Research Data Management with DataLad

Joint Lab Meeting (Gluth, Schuck & Schwabe Labs) at University of Hamburg

Slides | Source

License: CC BY-SA 4.0 DOI

November 28, 2024

About

About me

I am a Postdoctoral Researcher in Cognitive Neuroscience at the Institute of Psychology at the University of Hamburg (PI: Prof. Nicolas Schuck)

BSc Psychology & MSc Cognitive Neuroscience (TU Dresden), PhD Cognitive Neuroscience (Max Planck Institute for Human Development)

I study the role of fast neural memory reactivation in the human brain, applying machine learning and computational modeling to fMRI data

I am passionate about computational reproducibility, research data management, open science and tools that improve the scientific workflow

Find out more about my work on my website, Google Scholar and ORCiD

About this presentation

Slides: https://lennartwittkuhn.com/talk-uhh-rdm-2024

Source: https://github.com/lnnrtwttkhn/talk-uhh-rdm-2024

Software: Reproducible slides built with Quarto and deployed to GitHub Pages using GitHub Actions for continuous integration & deployment

License: Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)

Contact: Feedback or suggestions via email or GitHub issues. Thank you!

Acknowledgements and further reading

Slides by Wagner (2024) (CC BY-SA 4.0)

Slides and presentations by Dr. Adina Wagner and the DataLad team, e.g., “DataLad - Decentralized Management of Digital Objects for Open Science” (Wagner 2024)

The DataLad Handbook by Wagner et al. (2022) is a comprehensive educational resource for data management with DataLad.

Papers

Talks

… and many more!

Agenda

0. Background

1. Scientific workflows with DataLad

1.1 Version Control

1.2 Modularity & Linking

1.3 Provenance

1.4 Collaboration & Interoperability

2. Integrating DataLad with University of Hamburg infrastructure

2.1 UHHCloud (NextCloud)

2.2 UHH Object Storage

2.3 UHH GitLab

3. Integrating DataLad with Max Planck Society infrastructure

3.1 Keeper

3.2 Edmond

3.3 ownCloud / nextCloud

3.4 GitLab

4. Integrating DataLad with third-party infrastructure

4.1 GIN

5. Summary & Discussion

Background

Computational reproducibility

The issue of computational reproducibility in science

“… when the same analysis steps performed on the same dataset consistently produce the same answer.” 1

by Scriberia for The Turing Way Community (2022) (Link, CC BY 4.0)

The problem

  • about more than half of research is not reproducible 2
    • research data, code, software & materials are often not available “upon reasonable [sic] request”
    • if resources are shared, they are often incomplete
  • 90% of researchers: “reproducibility crisis” (N = 1576) 3

Why?

  • computational reproducibility is hard
  • researchers lack training
  • incentives are not (yet) aligned 4
  • Scientific workflows are special (see next slides)

… accumulated evidence indicates […] substantial room for improvement with regard to research practices to maximize the efficiency of the research community’s use of the public’s financial investment.(Munafò et al. 2017)

💡 We need a professional toolkit for digital research!

Scientific building blocks are not static

We need version control

Why we need version control

… for code (text files)

… for data (binary files) © Jorge Cham (phdcomics.com)

If everything is relevant, track everything.

What is version control?

“Version control is a systematic approach to record changes made in a […] set of files, over time. This allows you and your collaborators to track the history, see what changed, and recall specific versions later […]” (Turing Way)

keep track of changes in a directory (a “repository”)

take snapshots (“commits”) of your repo at any time

know the history: what was changed when by whom

compare commits and go back to any previous state

work on parallel “branches” & flexibly “merge” them

by The Turing Way Community and Scriberia (2024) (CC BY 4.0)

“push” your repo to a “remote” location & share it

share repos on platforms like GitHub or GitLab

work together on the same files at the same time

others can read, copy, edit and suggest changes

make your repo public and openly share your work

by The Turing Way Community and Scriberia (2024) (CC BY 4.0)

What are Git and DataLad?

git-scm.com (by Jason Long; CC BY 3.0 Unported)
  • most popular version control system
  • free, open-source command-line tool
  • graphical user interfaces exist, e.g., GitKraken
  • standard tool in the software industry
  • 100 million GitHub users 5

Sadly, Git does not handle large (binary) files well.

datalad.org (from the DataLad Handbook by Wagner et al. (2022); CC BY-SA 4.0)

Example Dataset: Brain Imaging Data

Single subject epoch (block) auditory fMRI activation data

mkdir neuro-data
wget https://www.fil.ion.ucl.ac.uk/spm/download/data/MoAEpilot/MoAEpilot.bids.zip \
-O neuro-data.zip
unzip neuro-data.zip -d neuro-data
rm neuro-data.zip
cd neuro-data
mv MoAEpilot/* .
rm -R MoAEpilot
tree
.
├── CHANGES
├── README
├── dataset_description.json
├── sub-01
│   ├── anat
│   │   └── sub-01_T1w.nii
│   └── func
│       ├── sub-01_task-auditory_bold.nii
│       └── sub-01_task-auditory_events.tsv
└── task-auditory_bold.json

4 directories, 7 files

Dataset from Functional Imaging Laboratory, UCL Queen Square Institute of Neurology, London, UK (Source)

Version Control with DataLad

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
datalad create neuro-data

View output
create(error): /tmp/neuro-data (dataset) [will not create a dataset in a non-empty directory, use `--force` option to ignore]

Rerun the command using --force:

datalad create --force neuro-data
create(ok): /tmp/neuro-data (dataset)

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
datalad save -m "save neuro data"

View output
add(ok): CHANGES (file)
add(ok): README (file)
add(ok): dataset_description.json (file)
add(ok): sub-01/anat/sub-01_T1w.nii (file)
add(ok): sub-01/func/sub-01_task-auditory_bold.nii (file)
add(ok): sub-01/func/sub-01_task-auditory_events.tsv (file)
add(ok): task-auditory_bold.json (file)
save(ok): . (dataset)                                              action summary:
  add (ok: 7)
  save (ok: 1)

Data in DataLad datasets are either stored in Git or git-annex

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)

Git

  • handles small files well (text, code)
  • file contents are in Git history and will be shared
  • Shared with every dataset clone
  • Useful: small, non-binary, frequently modified files

git-annex

  • handles all types and sizes of files well
  • file contents are in the annex, not necessarily shared
  • Can be kept private on a per-file level
  • Useful: Large files, private files

Science is build from modular units

We need modularity and linking

Version control beyond single repositories

Research as a sequence

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
  • Prior works (code development, empirical data, etc.) are combined to produce results with goal of a publication
  • Aggregation across time and contributors
  • Aiming for (but often failing) to be reproducible
  • Often, there is one big project folder

A single repository is not enough!

Research as a cycle

by The Turing Way Community and Scriberia (2024) (CC BY 4.0)
  • Develop scientific outputs as modular but linked units
  • Independently update and develop data sources
  • Manage access to public / private datasets

Nesting of modular DataLad datasets

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
  • seamless nesting of modular datasets in hierarchical super-/sub-dataset relationships
  • based in Git submodules, but mono-repo feel thanks to recursive operations
  • overcomes scaling issues with large amounts of files (Example: Human Connectome Project)
  • modularizes research components for transparency, reuse and access management

Example: Intuitive data analysis structure

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)

First, let’s create a new data analysis dataset:

datalad create -c yoda myanalysis
[INFO   ] Creating a new annex repo at /tmp/myanalysis
[INFO   ] Scanning for unlocked files (this may take some time)
[INFO   ] Running procedure cfg_yoda
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
create(ok): /tmp/myanalysis (dataset)

-c yoda initializes useful structure (details here):

tree
.
├── CHANGELOG.md
├── README.md
└── code
    └── README.md
2 directories, 3 files

We install analysis input data as a subdataset to the dataset:

datalad clone -d . https://github.com/datalad-handbook/iris_data.git input/
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
install(ok): input (dataset)
add(ok): input (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  install (ok: 1)
  save (ok: 1)

input is a regular folder inside myanalysis

tree
.
├── CHANGELOG.md
├── README.md
├── code
   └── README.md
└── input
    └── iris.csv
3 directories, 4 files

Modular units with clear provenance

git diff HEAD~1
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 0000000..fc69c84
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,5 @@
+[submodule "input"]
+       path = input
+       url = https://github.com/datalad-handbook/iris_data.git
+       datalad-id = 5800e71c-09f9-11ea-98f1-e86a64c8054c
+       datalad-url = https://github.com/datalad-handbook/iris_data.git
diff --git a/input b/input
new file mode 160000
index 0000000..b9eb768
--- /dev/null
+++ b/input
@@ -0,0 +1 @@
+Subproject commit b9eb768c145e4a253d619d2c8285e540869d2021

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
  • We know exactly where the subdataset comes from
  • We know exactly which version of the subdataset is installed
  • We can develop and update each subdataset independently

by The Turing Way Community and Scriberia (2024) (CC BY 4.0)

Science is exploratory and iterative

We need provenance

Reusing previous work is hard

by The Turing Way Community and Scriberia (2024)

Your number one collaborator is yourself from 6 months ago and they don’t answer emails.

by The Turing Way Community and Scriberia (2024)

Which version of which script produced these outputs from which version of which data?

Establishing provenance with DataLad

datalad run wraps around anything expressed in a command line call and saves the dataset modifications resulting from the execution.

datalad rerun repeats captured executions. If the outcomes differ, it saves a new state of them.

datalad containers-run executes command line calls inside a tracked software container and saves the dataset modifications resulting from the execution.

datalad containers-run \
  --message "Time series extraction from Locus Coeruleus"
  --container-name nilearn \
  --input 'mri/*_bold.nii' \
  --output 'sub-*/LC_timeseries_run-*.csv' \
  "python3 code/extract_lc_timeseries.py"
 
-- Git commit --
  commit 5a7565a640ff6de67e07292a26bf272f1ee4b00e
  Author:     Adina Wagner adina.wagner@t-online.de
  AuthorDate: Mon Nov 11 16:15:08 2019 +0100

  [DATALAD RUNCMD] Time series extraction from Locus Coeruleus
  === Do not change lines below ===
  {
   "cmd": "singularity exec --bind {pwd} .datalad/environments/nilearn.simg bash..",
   "dsid": "92ea1faa-632a-11e8-af29-a0369f7c647e",
   "inputs": [
    "mri/*.bold.nii.gz",
    ".datalad/environments/nilearn.simg"
   ],
   "outputs": ["sub-*/LC_timeseries_run-*.csv"],
   ...
  }
  ^^^ Do not change lines above ^^^

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
  • Enshrine the analysis in a script and record code execution together with input data, output files and software environment in the execution command
  • Result: Machine readable record about which data, code and software produced a result how, when and why
  • Use the unique identifier (hash) of the execution record to have a machine recompute and verify past work
datalad rerun 5a7565a640ff6de67
[INFO   ] run commit 5a7565a640ff6de67; (Time series extraction from Locus Coeruleus)
[INFO   ] Making sure inputs are available (this may take some time)
get(ok): mri/sub-01_bold.nii (file)
        [...]
[INFO   ] == Command start (output follows) =====
[INFO   ] == Command exit (modification check follows) =====
add(ok): sub-01/LC_timeseries_run-*.csv(file)

Science is collaborative & distributed

We need interoperability & transport logistics

Data sharing and collaboration with DataLad

I have a dataset on my computer.
How can I share it or collaborate on it?

tree
.
├── CHANGES
├── README
├── dataset_description.json
├── sub-01
│   ├── anat
│   │   └── sub-01_T1w.nii
│   └── func
│       ├── sub-01_task-auditory_bold.nii
│       └── sub-01_task-auditory_events.tsv
└── task-auditory_bold.json

4 directories, 7 files

Challenge: Scientific workflows are idiosyncratic across institutions / departments / labs / any two scientists

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)

Share data like code

  • With DataLad, you can share data like you share code: As version-controlled datasets via repository hosting services
  • DataLad datasets can be cloned, pushed and updated from and to a wide range of remote hosting services

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)

Interoperability with a range of hosting services

DataLad is built to maximize interoperability and streamline routines across hosting services and storage technology

see DataLad Handbook: “Beyond shared infrastructure”

Separate content in Git vs. git-annex behind the scenes

  • DataLad datasets are exposed via private or public repositories on a repository hosting service (e.g., GitLab or GitHub)
  • Data can’t be stored in the repository hosting service but can be kept in almost any third party storage
  • Publication dependencies automate interactions between both paces

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)

Special cases

Repositories with annex support

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
  • Easy: Only one remote repository
  • Examples: GIN, GitLab with annex support

Special remotes with repositories

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)
  • Flexible: Full history or single snapshot
  • Examples: DataLad-OSF

Have access to more data than you have disk-space

Cloned datasets are lean.

datalad clone git@gin.g-node.org:/lnnrtwttkhn/neuro-data.git
install(ok): /tmp/neuro-data (dataset)
cd neuro-data && du -sh
212K

“Metadata” (file names, availability) are present …

tree
.
├── CHANGES
├── README
├── dataset_description.json
├── sub-01
│   ├── anat
│   │   └── sub-01_T1w.nii
│   └── func
│       ├── sub-01_task-auditory_bold.nii
│       └── sub-01_task-auditory_events.tsv
└── task-auditory_bold.json

4 directories, 7 files

… but no file content:

open README
The file /tmp/README does not exist.

File contents can be retrieved on demand:

datalad get .
get(ok): CHANGES (file) [from origin...]
get(ok): README (file) [from origin...]
get(ok): dataset_description.json (file) [from origin...]
get(ok): sub-01/anat/sub-01_T1w.nii (file) [from origin...]
get(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [from origin...]
get(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [from origin...]
action summary:
  get (ok: 6)

Let’s check the dataset size again:

du -sh
49M

Drop file content that is not needed:

datalad drop .
drop(ok): CHANGES (file) [locking origin...]
drop(ok): README (file) [locking origin...]
drop(ok): dataset_description.json (file) [locking origin...]
drop(ok): sub-01/anat/sub-01_T1w.nii (file) [locking origin...]
drop(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [locking origin...]
drop(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [locking origin...]
drop(ok): . (directory)
action summary:
  drop (ok: 7)

When files are dropped, only “metadata” stays behind, and files can be re-obtained on demand.

Data sharing using DataLad and data infrastructure of the University of Hamburg

Sharing DataLad datasets via UHHCloud (Nextcloud)

DataLad NEXT extension allows to push / clone DataLad datasets to / from Nextcloud (via WebDAV)

UHHCloud:UHH members have a standard quota of 5 terabytes each (students have 100 gigabytes).

Features of UHHCloud (Nextcloud)

  • data privacy compliant alternative to Google Drive, Dropbox, etc. (usually hosted on-site)
  • provided by your institution, so free to use
  • supports private and public repositories
  • can be used together with external collaborators
  • expose datasets for regular download without DataLad
  1. Create a WebDAV sibling:
datalad create-sibling-webdav --dataset . \
  --name uhhcloud --mode filetree \
  'https://cloud.uni-hamburg.de/remote.php/dav/files/USERNAME/neuro-data'
  1. Push dataset to UHHCloud
datalad push --to uhhcloud

Access the dataset on cloud.uni-hamburg.de

Sharing DataLad datasets via UHH Object Storage (~ Amazon S3 Buckets)

See Chapter: “Amazon S3 as a special remote” on how to push / clone DataLad datasets to / from an Object Storage

Features of the UHH Object Storage (Cloudian)

  • “unlimited” storage for UHH employees
  • multiple backups across devices and locations
  • data privacy compliant alternative to Amazon S3 Buckets (hosted on-site, offering 99% compatibility)
  • provided by your institution, so free to use
  • supports private and public repositories
  • can be used together with external collaborators
  • expose datasets for regular download without DataLad
  1. Create a Object Storage sibling:
git annex initremote uhh-object-storage type=S3 encryption=none \
bucket=neuro-data public=no datacenter=EU \
host=s3-uhh.lzs.uni-hamburg.de protocol=https port=443 autoenable=true
  1. Push dataset to the UHH Object Storage
datalad push --to uhh-object-storage

Access the dataset the UHH Object Storage

Sharing DataLad datasets via GitLab

GitLab is open source software to collaborate on code. Manage git repositories with fine-grained access controls that keep your code secure.

GitLab for UHH students and employees

Features of GitLab

  • free to use and open-source
  • supports private and public repositories
  • use project management infrastructure (merge requests, issue boards, etc.) for your dataset projects
  1. Create a GitLab sibling
datalad siblings add --dataset . --name gitlab \
--url git@gitlab.rrz.uni-hamburg.de:wittkuhn/neuro-data.git
  1. Push dataset metadata to GitLab
datalad push --to gitlab

Access the dataset on GitLab

Data sharing using DataLad and data infrastructure of the Max Planck Society

Sharing DataLad datasets via Keeper

A free service for all Max Planck employees and project partners with more than 1TB of storage per user for your researchdata.

Features of Keeper

  • > 1 TB per Max Planck employee (and expandable):
  • based on cloud-sharing service Seafile
  • data hosted on MPS servers
  • configurable as a DataLad special remote
  1. Configure rclone
rclone config create neuro-data seafile \
url https://keeper.mpdl.mpg.de/ user wittkuhn@mpib-berlin.mpg.de \
library neuro-data pass supersafepassword
  1. Create a library on Keeper and a Keeper sibling
git annex initremote keeper type=external externaltype=rclone \
chunk=50MiB encryption=none target=neuro-data
  1. Push dataset to Keeper
datalad push --to keeper

Sharing DataLad datasets via Edmond

Edmond is a research data repository for Max Planck researchers. It is the place to store completed datasets of research data with open access.

Features of Edmond

  • based on Dataverse, hosted on MPS servers
  • use is free of charge
  • no storage limitation (on datasets or individual files)
  • flexible licensing

Two modes:

  1. annex mode (default): non-human readable representation of the dataset that includes Git history and annexed data

  2. filetree mode: human readable single snapshot of your dataset “as it currently is” that does not include history of annexed files (but Git history)

  1. Create a Dataverse sibling for Edmond:
datalad add-sibling-dataverse https://edmond.mpg.de/ \
doi:10.17617/3.8LDVXK --mode filetree
  1. Push dataset to Edmond / Dataverse
datalad push --to dataverse

Sharing DataLad datasets via ownCloud / Nextcloud

DataLad NEXT extension allows to push / clone DataLad datasets to / from ownCloud & Nextcloud (via WebDAV)

ownCloud GWDG:50 GByte default storage space per user; flexible increase possible upon request

Features of ownCloud and Nextcloud

  • data privacy compliant alternative to Google Drive, Dropbox, etc. (usually hosted on-site)
  • provided by your institution, so free to use
  • supports private and public repositories
  • can be used together with external collaborators
  • expose datasets for regular download without DataLad
  1. Create a WebDAV sibling:
datalad create-sibling-webdav --dataset . \
  --name owncloud-gwdg --mode filetree \
  'https://owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data'
  1. Push dataset to ownCloud
datalad push --to owncloud-gwdg

Access the dataset on owncloud.gwdg.de

Sharing DataLad datasets via GitLab

GitLab is open source software to collaborate on code. Manage git repositories with fine-grained access controls that keep your code secure.

GitLab for Max Planck employees

Features of GitLab

  • free to use and open-source
  • several MPS instances available (see above)
  • supports private and public repositories
  • use project management infrastructure (merge requests, issue boards, etc.) for your dataset projects
  1. Create a GitLab sibling
datalad siblings add --dataset . --name gitlab \
--url git@git.mpib-berlin.mpg.de:wittkuhn/neuro-data.git
  1. Push dataset metadata to GitLab
datalad push --to gitlab

Access the dataset on GitLab

Data sharing using DataLad and third-party data infrastructure

Sharing DataLad datasets via GIN

GIN is […] a web-accessible repository store of your data based on git and git-annex that you can access securely anywhere you desire while keeping your data in sync, backed up and easily accessible […]“

Features of GIN

  • free to use and open-source (could be hosted within your institution; for more details, see here)
  • currently unlimited storage capacity and no restrictions on individual file size
  • supports private and public repositories
  • publicly funded by the Federal Ministry of Education and Research (BMBF; details here)
  • servers on German land (Munich, Germany; cf. GDPR)
  • provides Digital Object Identifiers (DOIs) (details here) and allows free licensing (details here)
  1. Create a GIN sibling
datalad siblings add --dataset . \
--name gin --url git@gin.g-node.org:/lnnrtwttkhn/neuro-data.git
  1. Push dataset to GIN
datalad push --to gin

Access the dataset on GIN

Publish and consume datasets like source code

Datasets can comfortably live in multiple locations:

datalad siblings
.: here(+) [git]
.: owncloud-gwdg(+) [git]
.: dataverse(+) [dataverse]
.: gin(+) [https://gin.g-node.org/lnnrtwttkhn/neuro-data (git)]
.: keeper(+) [rclone]
.: gitlab(-) [git@git.mpib-berlin.mpg.de:wittkuhn/neuro-data.git (git)]

Publication dependencies automate update in all places:

datalad siblings configure --name gitlab --publish-depends SIBLING

Redundancy: DataLad gets data from available sources

from the DataLad Handbook by Wagner et al. (2022) (CC BY-SA 4.0)

Clone the dataset from GitLab:

datalad clone https://git.mpib-berlin.mpg.de/wittkuhn/neuro-data

Access to special remotes needs to be configured:

[INFO   ] access to 3 dataset siblings keeper, dataverse-storage,
owncloud-gwdg-storage not auto-enabled, enable with:
|       datalad siblings -d "/tmp/neuro-data" enable -s SIBLING

DataLad retrieves data from available sources (here, GIN):

datalad get .
get(ok): CHANGES (file) [from gin-src...]
get(ok): README (file) [from gin-src...]
[...]

Summary

Summary and discussion

Science is complex

  • Scientific units are not static: We need version control
  • Science is modular: We need to link modular datasets
  • Science is iterative: We need to establish provenance
  • Science is collaborative and distributed: We want to share our work and integrate with diverse infrastructure

DataLad: Decentralized management of digital objects for open science

  • DataLad can version control arbitrary datasets
  • DataLad links modular version-controlled datasets
  • DataLad establishes provenance and transparency
  • DataLad integrates with diverse infrastructure

Develop everything like source code

  • Code and data management using Git and DataLad (free, open-source command-line tools)
  • Code and data sharing via flexible repository hosting services (GitLab, GitHub, GIN, etc.)
  • Code and data storage on various infrastructure (GIN, OSF, S3, Keeper, Dataverse, and many more!)
  • Project-related communication (ideas, problems, discussions) via issue boards on GitLab / GitHub etc.
  • Transparent contributions to code and data via merge requests on GitLab (i.e., pull requests on GitHub)
  • Reproducible procedures using datalad run, rerun, and containers-run commands (also Make etc.)
  • Reproducible computational environments using software containers (e.g., Docker, Apptainer, etc.)

Towards science as distributed, open-source software knowledge development (cf. McElreath, 2020, 2023)

Overview of learning resources

Learn Git

Learn DataLad

Learn both (disclaimer: shameless plug 🙈)

Full-semester course on “Version control of code and data using Git and DataLad” at University of Hamburg (generously funded by the Digital and Data Literacy in Teaching Lab program) with many open educational resources (online guide, quizzes and exercises)

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Crüwell, Sophia, Deborah Apthorp, Bradley J. Baker, Lincoln Colling, Malte Elson, Sandra J. Geiger, Sebastian Lobentanzer, et al. 2023. “Whats in a Badge? A Computational Reproducibility Investigation of the Open Data Badge Policy in One Issue of Psychological Science.” Psychological Science 34 (4): 512–22. https://doi.org/10.1177/09567976221140828.
Gorgolewski, Krzysztof J., Tibor Auer, Vince D. Calhoun, R. Cameron Craddock, Samir Das, Eugene P. Duff, Guillaume Flandin, et al. 2016. “The Brain Imaging Data Structure, a Format for Organizing and Describing Outputs of Neuroimaging Experiments.” Scientific Data 3 (1). https://doi.org/10.1038/sdata.2016.44.
Hardwicke, Tom E., Manuel Bohn, Kyle MacDonald, Emily Hembacher, Michèle B. Nuijten, Benjamin N. Peloquin, Benjamin E. deMayo, Bria Long, Erica J. Yoon, and Michael C. Frank. 2021. “Analytic Reproducibility in Articles Receiving Open Data Badges at the Journal Psychological Science : An Observational Study.” Royal Society Open Science 8 (1). https://doi.org/10.1098/rsos.201494.
Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1). https://doi.org/10.1038/s41562-016-0021.
Obels, Pepijn, Daniël Lakens, Nicholas A. Coles, Jaroslav Gottfried, and Seth A. Green. 2020. “Analysis of Open Data and Computational Reproducibility in Registered Reports in Psychology.” Advances in Methods and Practices in Psychological Science 3 (2): 229–37. https://doi.org/10.1177/2515245920918872.
Poldrack, Russell A. 2019. “The Costs of Reproducibility.” Neuron 101 (1): 11–14. https://doi.org/10.1016/j.neuron.2018.11.030.
The Turing Way Community. 2022. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research. Zenodo. https://doi.org/10.5281/zenodo.3233853.
The Turing Way Community, and Scriberia. 2024. “Illustrations from the Turing Way: Shared Under CC-BY 4.0 for Reuse,” January. https://doi.org/10.5281/ZENODO.3332807.
Wagner, Adina S. 2024. “DataLad: Decrentalized Management of Digital Objects for Open Science.” Zenodo, January. https://doi.org/10.5281/ZENODO.10556597.
Wagner, Adina S., Laura K. Waite, Kyle Meyer, Marisa K. Heckner, Tobias Kadelka, Niels Reuter, Alexander Q. Waite, et al. 2022. “The DataLad Handbook,” April. https://doi.org/10.5281/ZENODO.6463273.
Wicherts, Jelte M., Denny Borsboom, Judith Kats, and Dylan Molenaar. 2006. “The Poor Availability of Psychological Research Data for Reanalysis.” American Psychologist 61 (7): 726–28. https://doi.org/10.1037/0003-066x.61.7.726.

Thank you!

Slides: https://lennartwittkuhn.com/talk-uhh-rdm-2024

Source: https://github.com/lnnrtwttkhn/talk-uhh-rdm-2024

Software: Reproducible slides built with Quarto and deployed to GitHub Pages using GitHub Actions for continuous integration & deployment

License: Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)

Contact: Feedback or suggestions via email or GitHub issues. Thank you!

Appendix

Example: “Let me just quickly copy those files …”

Without datalad run

Researcher writes some Python code to copy files:

for sourcefile, dest in zip(glob(path_source), glob(path_dest)):
  destination = path.join(dest, Path(sourcefile).name)
  shutil.move(sourcefile, destination)

glob does not sort! 😱

source/
├── sub-01
   └── sub-01-events.tsv
├── sub-02
   └── sub-02-events.tsv
├── sub-03
   └── sub-03-events.tsv
├── sub-04
   └── sub-04-events.tsv
[...]
destination/
├── sub-01
   └── sub-03-events.tsv
├── sub-02
   └── sub-01-events.tsv
├── sub-03
   └── sub-04-events.tsv
├── sub-04
   └── sub-02-events.tsv
[...]

Researcher shares analysis with collaborators.

With datalad run

Researcher uses datalad-run to copy files:

$ datalad run -m "Copy event files" \
"for sub in eventfiles;
    do mv ${sub}/events.tsv analysis/${sub}/events.tsv;
done"

empty

Walkthrough: Sharing DataLad datasets via Keeper

Configure rclone:

rclone config
2024/03/19 11:45:32 NOTICE: Config file "/root/.config/rclone/rclone.conf" not found - using defaults
No remotes found, make a new one?
n) New remote
s) Set configuration password
q) Quit config
name> neuro-data

Option Storage.
Type of storage to configure.
Choose a number from below, or type in your own value.
 1 / 1Fichier
   \ (fichier)
 2 / Akamai NetStorage
   \ (netstorage)
 3 / Alias for an existing remote
   \ (alias)
 4 / Amazon S3 Compliant Storage Providers including AWS, Alibaba, ArvanCloud, Ceph, ChinaMobile, Cloudflare, DigitalOcean, Dreamhost, GCS, HuaweiOBS, IBMCOS, IDrive, IONOS, LyveCloud, Leviia, Liara, Linode, Minio, Netease, Petabox, RackCorp, Rclone, Scaleway, SeaweedFS, StackPath, Storj, Synology, TencentCOS, Wasabi, Qiniu and others
   \ (s3)
 5 / Backblaze B2
   \ (b2)
 6 / Better checksums for other remotes
   \ (hasher)
 7 / Box
   \ (box)
 8 / Cache a remote
   \ (cache)
 9 / Citrix Sharefile
   \ (sharefile)
10 / Combine several remotes into one
   \ (combine)
11 / Compress a remote
   \ (compress)
12 / Dropbox
   \ (dropbox)
13 / Encrypt/Decrypt a remote
   \ (crypt)
14 / Enterprise File Fabric
   \ (filefabric)
15 / FTP
   \ (ftp)
16 / Google Cloud Storage (this is not Google Drive)
   \ (google cloud storage)
17 / Google Drive
   \ (drive)
18 / Google Photos
   \ (google photos)
19 / HTTP
   \ (http)
20 / Hadoop distributed file system
   \ (hdfs)
21 / HiDrive
   \ (hidrive)
22 / ImageKit.io
   \ (imagekit)
23 / In memory object storage system.
   \ (memory)
24 / Internet Archive
   \ (internetarchive)
25 / Jottacloud
   \ (jottacloud)
26 / Koofr, Digi Storage and other Koofr-compatible storage providers
   \ (koofr)
27 / Linkbox
   \ (linkbox)
28 / Local Disk
   \ (local)
29 / Mail.ru Cloud
   \ (mailru)
30 / Mega
   \ (mega)
31 / Microsoft Azure Blob Storage
   \ (azureblob)
32 / Microsoft Azure Files
   \ (azurefiles)
33 / Microsoft OneDrive
   \ (onedrive)
34 / OpenDrive
   \ (opendrive)
35 / OpenStack Swift (Rackspace Cloud Files, Blomp Cloud Storage, Memset Memstore, OVH)
   \ (swift)
36 / Oracle Cloud Infrastructure Object Storage
   \ (oracleobjectstorage)
37 / Pcloud
   \ (pcloud)
38 / PikPak
   \ (pikpak)
39 / Proton Drive
   \ (protondrive)
40 / Put.io
   \ (putio)
41 / QingCloud Object Storage
   \ (qingstor)
42 / Quatrix by Maytech
   \ (quatrix)
43 / SMB / CIFS
   \ (smb)
44 / SSH/SFTP
   \ (sftp)
45 / Sia Decentralized Cloud
   \ (sia)
46 / Storj Decentralized Cloud Storage
   \ (storj)
47 / Sugarsync
   \ (sugarsync)
48 / Transparently chunk/split large files
   \ (chunker)
49 / Union merges the contents of several upstream fs
   \ (union)
50 / Uptobox
   \ (uptobox)
51 / WebDAV
   \ (webdav)
52 / Yandex Disk
   \ (yandex)
53 / Zoho
   \ (zoho)
54 / premiumize.me
   \ (premiumizeme)
55 / seafile
   \ (seafile)
Storage> seafile

Option url.
URL of seafile host to connect to.
Choose a number from below, or type in your own value.
 1 / Connect to cloud.seafile.com.
   \ (https://cloud.seafile.com/)
url> https://keeper.mpdl.mpg.de/

Option user.
User name (usually email address).
Enter a value.
user> wittkuhn@mpib-berlin.mpg.de

Option pass.
Password.
Choose an alternative below. Press Enter for the default (n).
y) Yes, type in my own password
g) Generate random password
n) No, leave this optional password blank (default)
y/g/n> y
Enter the password:
password:
Confirm the password:
password:

Option 2fa.
Two-factor authentication ('true' if the account has 2FA enabled).
Enter a boolean value (true or false). Press Enter for the default (false).
2fa> false

Option library.
Name of the library.
Leave blank to access all non-encrypted libraries.
Enter a value. Press Enter to leave empty.
library> neuro-data

Option library_key.
Library password (for encrypted libraries only).
Leave blank if you pass it through the command line.
Choose an alternative below. Press Enter for the default (n).
y) Yes, type in my own password
g) Generate random password
n) No, leave this optional password blank (default)
y/g/n> n

Edit advanced config?
y) Yes
n) No (default)
y/n> n

Configuration complete.
Options:
- type: seafile
- url: https://keeper.mpdl.mpg.de/
- user: wittkuhn@mpib-berlin.mpg.de
- pass: *** ENCRYPTED ***
- library: neuro-data
Keep this "neuro-data" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y

Current remotes:

Name                 Type
====                 ====
neuro-data           seafile

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q
export KEEPER_PASSWORD=password
rclone config create neuro-data seafile url https://keeper.mpdl.mpg.de/ user wittkuhn@mpib-berlin.mpg.de library neuro-data pass $KEEPER_PASSWORD
git annex initremote keeper type=external externaltype=rclone chunk=50MiB encryption=none target=neuro-data
initremote keeper ok
(recording state in git...)
datalad siblings
.: here(+) [git]
.: keeper(+) [rclone]
datalad push --to keeper
copy(ok): CHANGES (file) [to keeper...]
copy(ok): README (file) [to keeper...]
copy(ok): dataset_description.json (file) [to keeper...]
copy(ok): sub-01/anat/sub-01_T1w.nii (file) [to keeper...]
copy(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [to keeper...]
copy(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [to keeper...]
copy(ok): task-auditory_bold.json (file) [to keeper...]            action summary:
  copy (ok: 7)

Walkthrough: Sharing DataLad datasets via Edmond

If you want to publish a dataset to Dataverse, you will need a dedicated location on Dataverse that we will publish our dataset to. For this, we will use a Dataverse dataset7.

  1. Go to Edmond, log in, and create a new draft Dataverse dataset via the Add Data header
  2. The New Dataset button takes you to a configurator for your Dataverse dataset. Provide all relevant details and metadata entries in the form8. Importantly, don’t upload any of your data files - this will be done by DataLad later.
  3. Once you have clicked Save Dataset, you’ll have a draft Dataverse dataset. It already has a DOI, and you can find it under the Metadata tab as “Persistent identifier”:
  4. Finally, make a note of the URL of your dataverse instance (e.g., https://edmond.mpg.de/), and the DOI of your draft dataset. You will need this information for step 3.

Add a Dataverse sibling to your dataset

We will use the datalad add-sibling-dataverse command. This command registers the remote Dataverse Dataset as a known remote location to your Dataset and will allow you to publish the entire Dataset (Git history and annexed data) or parts of it to Dataverse.

datalad add-sibling-dataverse https://edmond.mpg.de/ doi:10.17617/3.KUKEKI

If you run this command for the first time, you will need to provide an API Token to authenticate against the chosen Dataverse instance in an interactive prompt. This is how this would look:

A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token: 
A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token (repeat): 
Enter a name to save the credential securely for future reuse, or 'skip' to not save the credential
name: 

You’ll find this token if you follow the instructions in the prompt under your user account on your Dataverse instance, and you can copy-paste it into the command line.

add_sibling_dataverse.storage(ok): . [dataverse-storage: https://edmond.mpg.de/ (DOI: doi:10.17617/3.KUKEKI)]
[INFO   ] Configure additional publication dependency on "dataverse-storage"
A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token: 
A dataverse API token is required for access. Find it at https://edmond.mpg.de by clicking on your name at the top right corner and then clicking on API Token
token (repeat): 
Enter a name to save the credential securely for future reuse, or 'skip' to not save the credential
name: skip
add_sibling_dataverse(ok): . [dataverse: datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no&url=https%3A//edmond.mpg.de/&doi=doi:10.17617/3.KUKEKI (DOI: doi:10.17617/3.KUKEKI)]

As soon as you’ve created the sibling, you can push:

datalad push --to dataverse
copy(ok): CHANGES (file) [to dataverse-storage...]
copy(ok): README (file) [to dataverse-storage...]
copy(ok): dataset_description.json (file) [to dataverse-storage...]
copy(ok): sub-01/anat/sub-01_T1w.nii (file) [to dataverse-storage...]
copy(ok): sub-01/func/sub-01_task-auditory_bold.nii (file) [to dataverse-storage...]
copy(ok): sub-01/func/sub-01_task-auditory_events.tsv (file) [to dataverse-storage...]
copy(ok): task-auditory_bold.json (file) [to dataverse-storage...]
publish(ok): . (dataset) [refs/heads/main->dataverse:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->dataverse:refs/heads/git-annex [new branch]]  

Walkthrough: Sharing DataLad datasets via ownCloud / nextCloud

Get the WebDAV address

  1. Click on Settings (bottom left)
  2. Copy the WebDAV address, for example: https://owncloud.gwdg.de/remote.php/nonshib-webdav/
datalad create-sibling-webdav \
  --dataset . \
  --name owncloud-gwdg \
  --mode filetree \
1  'https://owncloud.gwdg.de/remote.php/nonshib-webdav/<dataset-name>'
1
Replace <dataset-name> with the name of your dataset, i.e., the name of your dataset folder. In this example, we replace <dataset-name> with neuro-data. The complete command for your example hence looks like this:
datalad create-sibling-webdav \
  --dataset . \
  --name owncloud-gwdg \
  --mode filetree \
  'https://owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data'

You will be asked to provide your ownCloud account credentials:

1user:
2password:
3password (repeat):
1
Enter the email address of your ownCloud account.
2
Enter the password of your ownCloud account.
3
Repeat the password of your ownCloud account.
create_sibling_webdav.storage(ok): . [owncloud-gwdg-storage: https://owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data]
[INFO   ] Configure additional publication dependency on "owncloud-gwdg-storage" 
create_sibling_webdav(ok): . [owncloud-gwdg: datalad-annex::?type=webdav&encryption=none&exporttree=yes&url=https%3A//owncloud.gwdg.de/remote.php/nonshib-webdav/neuro-data]

Finally, we can push the dataset to ownCloud:

1datalad push --to owncloud-gwdg
1
Use datalad push to push the dataset contents to ownCloud. For details on datalad push, see the command line reference and this chapter in the DataLad Handbook.

We can now view the files on ownCloud and inspect them through the web browser.

Footnotes

  1. The Turing Way Community (2022), see “Guide on Reproducible Research”

  2. for example, in Psychology: Crüwell et al. (2023); Hardwicke et al. (2021); Obels et al. (2020); Wicherts et al. (2006)

  3. see Baker (2016), Nature

  4. see e.g., Poldrack (2019)

  5. (Source: Wikipedia)

  6. see DataLad dataset of 80TB / 15 million files from the Human Connectome Project (see details)

  7. Dataverse datasets contain digital files (research data, code, …), amended with additional metadata. They typically live inside of dataverse collections.

  8. At least, Title, Description, and Organization are required.