Session 8: Summary & Outlook

Track, organize and share your work: Version control of code & data with Git & DataLad

Course at AUDICTIVE Priority Program

Slides | Source

License: CC BY 4.0 DOI

16:00

1 Summary

How are you now?

Schedule

No Time Title Contents
1 09:30 - 10:00 Introduction to Version Control Logistics and course admin
Introduction to reproducibility
Introduction to version control
Introduction to Git
2 10:00 - 10:45 Basics of the Command Line File systems and navigation
Benefits of the command line
Basic command line commands
3 10:45 - 11:00 Setup & configuration of Git Setup & configuration of Git
4 11:00 - 12:00 Basics of Git Initializing a Git repository
Practicing basic Git commands
Tracking changes wih Git
Ignoring files with .gitignore
Good commit messages
5 12:00 - 13:00 Lunch Break Enjoy your lunch!
6 13:00 - 14:00 Integration with GitHub / GitLab Introduction to remote repositories
Managing repositories on GitHub / GitLab
Pushing and pulling changes
Cloning a remote repository
7 14:00 - 15:00 Version Control of Data with DataLad Version control of (large) data with DataLad
Nesting modular datasets with DataLad
Establishing provenance and reproducibility with DataLad
8 16:00 - 16:30 Summary & Outlook Summary of course contents
Outlook to more related topics
Discussing open questions

Learning Objectives

Introduction to Version Control

💡 You know what version control is.
💡 You can argue why version control is useful (for research).
💡 You can name benefits of Git compared to other approaches to version control.
💡 You can explain the difference between Git and GitHub.

Basics of the Command Line

💡 You can name the advantages of command-line interfaces for Git.
💡 You can navigate directories using absolute and relative paths.
💡 You can use shortcuts like the tilde or dots to navigate your file system.
💡 You can apply arguments and flags to customize command-line commands.
💡 You can use wildcards (*) for file selection.
💡 You can combine command-line commands.

Setup

💡 You know how to set up Git for the first time
💡 You have set up Git on your computer
💡 You understand the difference between the three Git configuration levels
💡 You know how to configure your username and email address in Git
💡 You have set up your preferred text editor when working with Git
💡 You can escape the command-line text editor Vim

Learning Objectives (continued)

First steps with Git

💡 You can initialize a Git repository.
💡 You can check the status of a Git repository.
💡 You understand the difference between the staging area and a commit.
💡 You can stage and commit changes.
💡 You understand the difference between a commit message and a description.

Git Essentials

💡 You know how to explore the commit history.
💡 You can compare different commits.
💡 You know how to use and create a .gitignore file.
💡 You can discuss which files can (not) be tracked well with Git and why.
💡 You know how to track empty folders in Git repositories.

Integration with GitHub / GitLab

💡 You can create a remote repository.
💡 You can connect your local Git repository to a remote repository service like GitHub or GitLab.
💡 You can pull and push changes to and from a remote repository.
💡 You can clone a repository from a remote repository.

Learning Objectives (continued)

Introduction to DataLad

💡 You know how to configure your username and email address in Git.
💡 You can create a new DataLad dataset.
💡 You know how to check the status of a DataLad dataset.
💡 You can save data in a DataLad dataset.
💡 You know about different configurations of DataLad datasets.

Nesting with DataLad

💡 You can install an existing DataLad dataset as a subdataset.
💡 You can get and drop data in a DataLad dataset as needed.
💡 You know how to navigate nested DataLad datasets.
💡 You know how to access data in nested DataLad datasets recursively.

Provenance with DataLad

💡 You can link analyses to inputs and outputs using DataLad.
💡 You can execute a rerun of a previous analysis with DataLad
💡 You know how to establish provenance and reproducibility using DataLad.

2 There’s more …

Rewriting history

See chapter “Rewriting History”

Credit: tech_kody via TikTok

Tags, releases, DOIs: Integration with Zenodo

Zenodo, a CERN service, is an open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science.” – from the Zenodo GitHub README

Integrate your repository on GitHub with Zenodo

To make your repositories easier to reference in academic literature, you can create persistent identifiers, also known as Digital Object Identifiers (DOIs). You can use the data archiving tool Zenodo to archive a repository on GitHub.com and issue a DOI for the archive.” – Details in the GitHub documentation

  1. Navigate to the login page for Zenodo.
  2. Click Log in with GitHub.
  3. Review the information about access permissions, then click Authorize zenodo.
  4. Navigate to the Zenodo GitHub page.
  5. To the right of the name of the repository you want to archive, toggle the button to On.

See our book chapter on “Tags & Releases”.

Graphical User Interfaces (GUIs) for Git

Integrated Development Environments (IDEs)

RStudio

MATLAB

Git Clients

GitKraken

GitHub Desktop

Mobile

Working Copy (iOS)

Continuous Integration & Deployment (CI/CD)

from Suresoft

Example: Lennart’s recipes repo

  • Automated spell check
  • Rebuilding of project website

https://lennartwittkuhn.com/recipes/

3 Discussion

Science as distributed open-source knowledge development 1

How can we do better science?

The long-term challenges are non-technical

  • open-source, avoiding commercial vendor lock-in
  • adopting new practices and upgrading workflows
  • moving towards a “culture of reproducibility” 2
  • changing incentives, policies & funding schemes

Technical solutions already exist!

  • Version control of digital research outputs (e.g., Git, DataLad)
  • Integration with flexible infrastructure (e.g., GitLab)
  • Systematic contributions & review (e.g., pull/merge requests)
  • Automated integration & deployment (e.g., CI/CD)
  • Reproducible computational environments (e.g., Docker)
  • Transparent execution and build systems (e.g., GNU Make)
  • Project communication next to code & data (e.g., Issues)

Reproducibility is a spectrum and a journey

4 Feedback

Feedback

5 Questions?

References

The Turing Way Community. (2022). The turing way: A handbook for reproducible, ethical and collaborative research. Zenodo. https://doi.org/10.5281/zenodo.3233853.

Footnotes

  1. inspired by Richard McElreath’s “Science as Amateur Software Development” (2023)

  2. see “Towards a culture of computational reproducibility” by Russ Poldrack, Stanford University