Effective Progress Tracking and Collaboration: An Introduction to Version Control of Code and Data

University of Hamburg

Version Control
Git
Reproducibility
Open Science
Datalad
Teaching
Course

Winter 2023/24  Local  6 attendees

Published

October 20, 2023

Course Description

The digital objects on our computers are in a constant state of flux. Manuscripts, programming code and research data are changed continuously over long periods of time, often in close collaboration with others. A systematic documentation of these changes forms the basis for controlled and reproducible work on code and data. In a seminar with hands-on exercises, course participants will learn “Effective Progress Tracking and Collaboration: An Introduction to Version Control of Code and Data”.

Version control is the notebook for a digital world and Git is probably the best known version control system. Git allows to precisely document changes in digital objects and thus to track who changed what, when, how and why in a file. Changes can be revised, versions can be compared and restored. In addition, it is possible to work simultaneously on the same file and systematically integrate parallel versions. In addition, Git enables effective collaboration. Via platforms such as GitHub or GitLab, code and data can be shared with the world, transparently viewed by others, used and collaboratively developed. In this way, version control helps to ensure that knowledge generated from data is transparent, accessible and verifiable. As an effective method for storing and manipulating code and data, version control thus represents a core competency of data literacy.

Because Git was developed for versioning small, text-based files (such as programming code), its usability with larger, binary files (such as image or video data) is limited for technical reasons. The open-source software DataLad extends the features of Git to provide version control for large data sets (up to several terabytes). DataLad is applicable to arbitrary data structures and is independent of any central infrastructure or third-party vendors. Like Git, DataLad allows tracking changes to data and restoring previous versions of a data set. Furthermore, DataLad allows you to capture the digital provenance of data and accurately reproduce analyses of the data. In addition, DataLad allows to publish data on a variety of platforms or use existing data.