Session 8: Summary & Outlook

Track, organize and share your work: Version control of code & data with Git & DataLad

Course at AUDICTIVE Priority Program

Slides | Source

Dr. Lennart Wittkuhn

lennart.wittkuhn@tutanota.com

16:00

1 Summary

How are you now?

Schedule

No	Time	Title	Contents
1	09:30 - 10:00	Introduction to Version Control	Logistics and course admin Introduction to reproducibility Introduction to version control Introduction to Git
2	10:00 - 10:45	Basics of the Command Line	File systems and navigation Benefits of the command line Basic command line commands
3	10:45 - 11:00	Setup & configuration of Git	Setup & configuration of Git
4	11:00 - 12:00	Basics of Git	Initializing a Git repository Practicing basic Git commands Tracking changes wih Git Ignoring files with `.gitignore` Good commit messages
5	12:00 - 13:00	Lunch Break	Enjoy your lunch!
6	13:00 - 14:00	Integration with GitHub / GitLab	Introduction to remote repositories Managing repositories on GitHub / GitLab Pushing and pulling changes Cloning a remote repository
7	14:00 - 15:00	Version Control of Data with DataLad	Version control of (large) data with DataLad Nesting modular datasets with DataLad Establishing provenance and reproducibility with DataLad
8	16:00 - 16:30	Summary & Outlook	Summary of course contents Outlook to more related topics Discussing open questions

Learning Objectives

Introduction to Version Control

💡 You know what version control is.
💡 You can argue why version control is useful (for research).
💡 You can name benefits of Git compared to other approaches to version control.
💡 You can explain the difference between Git and GitHub.

Basics of the Command Line

💡 You can name the advantages of command-line interfaces for Git.
💡 You can navigate directories using absolute and relative paths.
💡 You can use shortcuts like the tilde or dots to navigate your file system.
💡 You can apply arguments and flags to customize command-line commands.
💡 You can use wildcards (*) for file selection.
💡 You can combine command-line commands.

Setup

💡 You know how to set up Git for the first time
💡 You have set up Git on your computer
💡 You understand the difference between the three Git configuration levels
💡 You know how to configure your username and email address in Git
💡 You have set up your preferred text editor when working with Git
💡 You can escape the command-line text editor Vim

Learning Objectives (continued)

First steps with Git

💡 You can initialize a Git repository.
💡 You can check the status of a Git repository.
💡 You understand the difference between the staging area and a commit.
💡 You can stage and commit changes.
💡 You understand the difference between a commit message and a description.

Git Essentials

💡 You know how to explore the commit history.
💡 You can compare different commits.
💡 You know how to use and create a .gitignore file.
💡 You can discuss which files can (not) be tracked well with Git and why.
💡 You know how to track empty folders in Git repositories.

Integration with GitHub / GitLab

💡 You can create a remote repository.
💡 You can connect your local Git repository to a remote repository service like GitHub or GitLab.
💡 You can pull and push changes to and from a remote repository.
💡 You can clone a repository from a remote repository.

Learning Objectives (continued)

Introduction to DataLad

💡 You know how to configure your username and email address in Git.
💡 You can create a new DataLad dataset.
💡 You know how to check the status of a DataLad dataset.
💡 You can save data in a DataLad dataset.
💡 You know about different configurations of DataLad datasets.

Nesting with DataLad

💡 You can install an existing DataLad dataset as a subdataset.
💡 You can get and drop data in a DataLad dataset as needed.
💡 You know how to navigate nested DataLad datasets.
💡 You know how to access data in nested DataLad datasets recursively.

Provenance with DataLad

💡 You can link analyses to inputs and outputs using DataLad.
💡 You can execute a rerun of a previous analysis with DataLad
💡 You know how to establish provenance and reproducibility using DataLad.

2 There’s more …

Rewriting history

See chapter “Rewriting History”

Tags, releases, DOIs: Integration with Zenodo

“Zenodo, a CERN service, is an open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science.” – from the Zenodo GitHub README

Integrate your repository on GitHub with Zenodo

“To make your repositories easier to reference in academic literature, you can create persistent identifiers, also known as Digital Object Identifiers (DOIs). You can use the data archiving tool Zenodo to archive a repository on GitHub.com and issue a DOI for the archive.” – Details in the GitHub documentation

Navigate to the login page for Zenodo.
Click Log in with GitHub.
Review the information about access permissions, then click Authorize zenodo.
Navigate to the Zenodo GitHub page.
To the right of the name of the repository you want to archive, toggle the button to On.

See our book chapter on “Tags & Releases”.

“Making your project citable” by CodeRefinery (CC BY 4.0)

Graphical User Interfaces (GUIs) for Git

Integrated Development Environments (IDEs)

RStudio

MATLAB

Git Clients

GitKraken

GitHub Desktop

Mobile

Working Copy (iOS)

Continuous Integration & Deployment (CI/CD)

Example: Lennart’s `recipes` repo

Automated spell check
Rebuilding of project website

https://lennartwittkuhn.com/recipes/

3 Discussion

Navigating towards open and reproducible research

Reflect on the following discussion questions:

What are your experiences with (non-)reproducible research?
What has helped you in the past to make your research reproducible?
What are personal and general hurdles for reproducible research?
What can you do to address them?

Science as distributed open-source knowledge development ¹

How can we do better science?

The long-term challenges are non-technical

open-source, avoiding commercial vendor lock-in
adopting new practices and upgrading workflows
moving towards a “culture of reproducibility” ²
changing incentives, policies & funding schemes

Technical solutions already exist!

Version control of digital research outputs (e.g., Git, DataLad)
Integration with flexible infrastructure (e.g., GitLab)
Systematic contributions & review (e.g., pull/merge requests)
Automated integration & deployment (e.g., CI/CD)
Reproducible computational environments (e.g., Docker)
Transparent execution and build systems (e.g., GNU Make)
Project communication next to code & data (e.g., Issues)

Source: “Strategy for Cultural Change” (2019) by the Center for Open Science

In science, we try to generate knowledge about the world
For the sake of insight or explanation but also to perform evidence-based interventions
Problem: We need to integrate our work with the work of other people into a common body of knowledge
Process of continuous integration
It’s fair to say that the way that this is done in science is fairly chaotic
Other disciplines with a analogous problem have professionalized the process of continuous integration
primary analogy: software development
why? a lot of contemporary science involves digital research data and involves software development (or code) to analyze these data (or perhaps disciplines could really benefit from this)
this might be shocking to some (you want to study the brain but now you need to code) but this is the way it is
software development is a standard part of being a scientist
we have to understand the tools that we use to do our job
software development has a lot of tools that allow to handle continuous integration professionally
distributed and asynchronous work in large, international teams
main work includes working with data using code
when you train as a software developer, you learn a common stack of tools
Testing: writing code to test if code works

Reproducibility is a spectrum and a journey

“Reproducibility Scale” by Heidi Seibold and Rabea Müller and The Digital Research Academy Community and The BERD Academy (License: CC BY 4.0)

by Scriberia for The Turing Way Community (2022) (Link, CC BY 4.0)

4 Feedback

Feedback

Please complete the feedback survey: https://version-control-feedback.formr.org/
This should not take much longer than 15 minutes.

5 Questions?

References

The Turing Way Community. (2022). The turing way: A handbook for reproducible, ethical and collaborative research. Zenodo. https://doi.org/10.5281/zenodo.3233853.

Footnotes

inspired by Richard McElreath’s “Science as Amateur Software Development” (2023)
see “Towards a culture of computational reproducibility” by Russ Poldrack, Stanford University

Session 8: Summary & Outlook

1 Summary

How are you now?

Schedule

Learning Objectives

Introduction to Version Control

Basics of the Command Line

Setup

Learning Objectives (continued)

First steps with Git

Git Essentials

Integration with GitHub / GitLab

Learning Objectives (continued)

Introduction to DataLad

Nesting with DataLad

Provenance with DataLad

2 There’s more …

Rewriting history

Tags, releases, DOIs: Integration with Zenodo

Integrate your repository on GitHub with Zenodo

Graphical User Interfaces (GUIs) for Git

Integrated Development Environments (IDEs)

RStudio

MATLAB

Git Clients

GitKraken

GitHub Desktop

Mobile

Working Copy (iOS)

Continuous Integration & Deployment (CI/CD)

Example: Lennart’s recipes repo

3 Discussion

Navigating towards open and reproducible research

Science as distributed open-source knowledge development 1

How can we do better science?

The long-term challenges are non-technical

Technical solutions already exist!

Reproducibility is a spectrum and a journey

4 Feedback

Feedback

5 Questions?

References

Footnotes

Example: Lennart’s `recipes` repo

Science as distributed open-source knowledge development ¹