In this chapter, you will learn how to explore the commit history, compare different commits and let Git ignore certain files. You will also learn more about which files can (not) be tracked well with Git and why.
Learning Objectives
💡 You know how to explore the commit history 💡 You can compare different commits 💡 You know how to use and create a .gitignore file 💡 You can discuss which files can (not) be tracked well with Git and why 💡 You know how to track empty folders in Git repositories
Now that you know how to create a Git repository, stage files, and make commits, it’s time to discuss some good practices for Git commits.
6.1.1 Commit often
Creating frequent commits helps to capture your work progress. This makes it easier to track changes, identify when errors were introduced, or revert to a previous version of your project if something goes wrong. You can think of committing like saving your game in a video game (or rock climbing, see Note 6.1). You want to create saves (commits) regularly so that if you encounter an issue, you don’t lose too much progress.
Note 6.1: The rock climbing analogy of Git commits
Using Git commits can be compared to rock climbing:
Using a Git commit is like using anchors and other protection when climbing. If you’re crossing a dangerous rock face you want to make sure you’ve used protection to catch you if you fall. Commits play a similar role: if you make a mistake, you can’t fall past the previous commit. Coding without commits is like free-climbing: you can travel much faster in the short-term, but in the long-term the chances of catastrophic failure are high! Like rock climbing protection, you want to be judicious in your use of commits. Committing too frequently will slow your progress; use more commits when you’re in uncertain or dangerous territory. Commits are also helpful to others, because they show your journey, not just the destination.
You should not consider a commit as a general “Save” button, that you use to save all your recent changes in one go. Instead, each commit should ideally contain one isolated and complete change. This makes your project history easier to follow.
For example, if you want to rename a variable and add a new enhancement, put the variable rename in one commit and the enhancement in another commit.
This approach to commits helps you down the line. For example, if you want to keep the enhancement but revert the renaming of the variable, you can revert the specific commit that contained the variable rename. If you put the variable rename and the enhancement in the same commit or spread the variable rename across multiple commits, you would spend more effort reverting your changes.
6.1.3 Use meaningful commit messages
Git allows for a commit message with a 72 character limit and a description without such a limit. In general, you should aim to write clear and short commit messages (ideally under 50 characters), yet descriptive enough to clarify the purpose of the commit. There are different conventions across projects/persons but also some general guidelines, you can stick to.
It’s helpful to think about future you or others looking at this project later. Will they understand what this commit was about without having to look at the actual changes?
For example, Fix typo in README would be more descriptive as a commit message than Update files. If more detail is necessary, you can use the option to add a longer description to the commit (as we discussed earlier).
6.1.4 Write commits that explain why changes were made
Instead of just stating what you did in a commit, it’s even more helpful to explain why you made those changes. The “why” gives important context, especially when reviewing older commits.
For example, use Adjust participant exclusion criteria to improve data quality instead of Change participant exclusion criteria.
6.1.5 Start with an imperative verb
It is standard to start the commit message with an imperative verb to indicate the purpose of the commit, for example, “Add”, “Fix” or “Improve”. Here is an example:
Example
git commit -m"Implement user registration featureThis commit adds the user registration functionality to the application. It includes the following changes:- Created a new 'register' route and view for user registration.- Added form validation for user input."
An example of commit messages, which are not really optimal, are illustrated in Figure 6.1.
Commit messages examples
Some examples for commit messages, which stick to the etiquette:
"Refactor database query functions for efficiency"
"Update installation instructions in README"
6.1.6 Stage changes selectively
Only stage and commit files you actually want Git to track. You can use git status to double-check what files you have staged. You can stage files selectively using git add to avoid committing unnecessary files (like temporary files or configuration files specific to your system). Using a .gitignore file to prevent certain files from being tracked by Git is also a good idea (for details, see below).
In principle, any file can be tracked with Git. However, to make use of the full potential of Git, it is recommended to mainly track plain text files with Git. For recommendations on which file to track or not track with Git, see Tip 6.1 and Tip 6.2, respectively.
Tip 6.1: Which files can I track with Git?
Code files
Tracking code files is the most common and original use case for version control with Git. Git is well-suited for tracking changes in source code, and it’s widely used by developers and teams for this purpose. Whether it’s a single developer maintaining a personal project or a large team collaborating on a complex software system, Git excels at tracking code files throughout the development lifecycle.
Plain textfiles
Git is a useful tool even if your project contains few or no code at all, especially in comparison to a setup that uses emails or shared Dropbox folders for “version control” (see the introduction chapter). However, to really be able to use the full set of features of Git, you should rely on plain text files for your project. This is because .docx files are saved as binary files, which makes meaningful outputs of the text inside it impossible for Git. “Plain textfiles” does not mean you have to use .txt files. Instead you can use formattable Markdown (.md) files. Markdown (.md) files are plain text files that contain formatting elements, making them more versatile than plain .txt files. Markdown allows you to add simple formatting like headings, lists, links, and images.
Tip 6.2: Which files should I not track with Git?
Microsoft Office files
As said earlier, it is not recommended to track Microsoft office files using Git, since Git treats .docx files as binary files. Binary files lack the inherent structure that allows Git to capture and display changes effectively. Git relies on understanding the differences between versions of files. With binary files, you won’t benefit from the ability to view detailed textual differences (git diff) or effectively use branches. Collaborative work on binary files, such as simultaneous editing of a Word document, may result in complex merge conflicts that are challenging to resolve.
So while it is possible to track .docx files with Git, you will not be able to use the many features of Git, which rely on Git being able to display the file, since Git can only output the “zeros and ones” and not the text inside the .docx files.
If you ever choose to track them regardless, you should use detailed commit messages, since you will not be able to easily look at the text content of past versions. You will also need to know about temporary word files. For it’s own version control system, Word creates temporary files in the folder of your word file. You should not track these files using Git, since Word creates a lot of them. To not stage or commit these files by accident, you can use the .gitignore file.
6.1.7 Review your changes before committing
Before committing your changes, it’s a good habit to review the differences between your working directory and the staged files using git diff (for details, see below). This ensures that you’re aware of every change you’re about to commit.
6.1.8 Reference issue numbers
If your project is connected to an issue tracker system (like GitHub Issues), it’s helpful to reference issue numbers in your commit messages. This makes it easy to trace which commit addressed a particular issue. For an intro about GitHub Issues, refer to our Issues chapter. For example:
Code
git commit -m"Fix login bug (fixes #42)"
This message indicates that the commit addresses issue number 42 in your issue tracker.
6.1.9 Avoid committing sensitive information
Be careful not to commit sensitive information such as passwords or personal data. Once something is committed, it becomes part of the repository’s history, making it difficult to completely remove later.
You can use .gitignore files to exclude sensitive files from being tracked (details below).
6.2 Logging commits
To look at the history of your past commits, you can use the git log command.
Code
git log
You should see output similar to the following:
Output
1commit e9ea80781ceed7cc3d6bff0c7bfa71f320ec1f60 (HEAD -> main)2Author: Jane Doe <jane@example.com>3Date: Thu Jun 29 12:23:53 2023 +02004Add filename.txt file
1
This line shows the unique identifier (commit hash) for this specific change in the code. The HEAD -> main part indicates that this commit is the latest one on the main branch of the repository.
2
This line tells us who made the change and the email address of the author. This is the information specified using git config (for details, see the setup chapter).
3
This line provides the date and time when the change was made.
4
This line is the commit message provided by the author, which briefly describes what this change does.
git log is a very useful command because it provides a clear and organized view of a repository’s commit history. It allows you to understand the evolution of a project over time by showing detailed information about each commit, including changes made, authors, and timestamps.
To exit the git logview and return to the regular Terminal, press the qq key. This applies to both macOS and Windows systems.
Common git log flags
--oneline: Provides a condensed output with each commit displayed on a single line, showing the abbreviated commit hash and commit message.
-n or --max-count=: Limits the number of commits shown to the specified . For example, git log -n 5 will display the latest 5 commits.
--since=: Shows commits made after the specified . You can use various date formats, such as specific dates or relative expressions like “2 weeks ago” or “yesterday”.
--until=: Shows commits made before the specified .
--author=: Filters commits by the author’s name or email using a specified .
What is HEAD?
HEAD can be thought as the current spot in your project’s timeline, like a bookmark. When you see HEAD -> main, it means your bookmark is on the main branch, usually at the most recent update. It moves with each commit you make, staying always at the forefront of your project’s development. When you switch branches with a command like git switch, HEAD moves to point to the tip of the new branch, changing the context of your working directory to reflect the latest commit on that branch. HEAD is used in many Git commands to reference the current commit. For example, git reset HEAD is a command used to unstage changes by moving the current branch’s tip back to HEAD, without altering the working directory. Similarly, git checkout HEAD~1 will move HEAD back one commit as a way to look at or revert to a previous state of the project without making any changes to the branch itself.
6.3 Comparing versions
Another very handy feature is the git diff command. It allows you to compare two different versions of your file(s).
By default it shows you any uncommitted changes since the last commit. You can explore this by pasting any additional text in your .txt file, for example:
Code
Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
You can then look at the changes using:
Code
git diff
You should see an output similar to the following:
Output
+++ b/filename.txt@@-0,0 +1 @@+Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\ No newline at end of file
The output of git diff includes several pieces of information:
+++ and --- indicate the paths to the files being compared.
@@ -0,0 +1 @@ is a unified diff header. It shows where the changes occurred. Note, that this line will look different for each change or commit.
+ indicates added lines and - indicates removed lines. Lines that are identical between the two versions are not explicitly shown.
You can also use git diff to compare two specific commits. For example, if you want to compare the current state of your file with a commit from a few commits ago:
Replace <filename> with the actual name of the file that you want to inspect.
This command can also be combined with additional flags and options.
Common git diff flags
--cached or --staged: Shows the changes that are staged (added) but not yet committed.
<commit>: Displays the difference between the current working directory and a specific commit. For example, git diff abc123 shows the changes compared to commit abc123.
<commit> .. <commit>: Shows the difference between two specific commits. For example, git diff abc123 def456 displays the changes between commit abc123 and commit def456.
--name-only: Outputs only the names of the files that have differences, without showing the actual changes.
--name-status: Displays the names of files along with a status identifier that indicates if a file was added, modified, or deleted.
--color-words: Highlights the differences at the word level, providing more granular detail.
6.4 Commit hashes
To compare the changes between two commits, you can use git diff followed by the commit hashes of the two points in history you’re interested in comparing. A commit hash is akin to a unique fingerprint for each commit you make to your repository. This hash is a long string of characters generated by Git using a cryptographic hash function (specifically SHA-1, though Git is transitioning to SHA-256 for enhanced security). The purpose of this hash is to uniquely identify every single change set or commit in the history of your repository. Think of it as a precise identifier that Git uses to track the specific state of your project at the moment of each commit. Every commit hash is unique. Even the slightest change results in a completely different hash.
6.5 Amending a commit
The git commit --amend command is used to modify (or “amend”) your most recent Git commit in your repository. It allows you to edit the commit message, add more changes to the commit, or simply adjust the last commit without creating a new commit. This flag can be useful if you forgot to include something in your last commit, made a typo or want to change your commit message. You can also combine --amend with other flags. To change the commit message of your last commit (without opening an editor) try:
Replace "changed commit message" with your actual improved commit message.
If you want to to keep your commit message but include new changes to your file(s) in you last commit, you can use:
Code
git commit --amend--no-edit
The --no-edit flag, lets you keep the commit message of your most recent commit.
6.5.0.1 The repeated --amend workflow
Let’s consider the following part of the rock climbing analogy for commits again (for the full analogy, see Note 6.1):
use more commits when you’re in uncertain or dangerous territory.
The “repeated amend” is a Git workflow where you avoid cluttering your history with numerous tiny commits. Instead, you gradually build up a “good” commit by amending it repeatedly. Continue making small changes and amending the existing commit, refining it over time. This method keeps your Git history useful and looking good, making sure your commits tell a straightforward story of how your project evolved.
This workflow can be useful when you’re working on a new feature or a significant change and you might make several small adjustments to get it right. Instead of cluttering your history with numerous tiny commits for each tweak, you can use the “repeated amend” to gradually build up a well-polished commit that includes the entire feature.
6.6 Partial commits
Git allows you to make partial commits by staging only specific parts of your file before committing. This can be achieved using git add with the -p or --patch option, which allows you to interactively choose which changes to stage.
To try this make some changes to your file(s) and use:
Code
git add -p
This will prompt you with each change, giving you options to stage, skip, or split the changes. You’ll see a series of prompts like this:
Output
+ Example text ...+++(1/x)Stage this hunk [y,n,q,a,d,e,?]?
These prompt options respectively stand for:
y: Stage this hunk.
n: Do not stage this hunk.
q: Quit. Do not stage this hunk or any remaining hunks.
a: Stage this hunk and all later hunks in the file.
d: Do not stage this hunk or any later hunks in the file.
/: Search for a hunk matching the given regex.
e: Manually edit the current hunk.
?: Print help.
Type one of the symbols in the command line to proceed in the desired manner.
What is a “hunk”?
In Git, a “hunk” refers to a distinct block of code changes within a file. It represents a cohesive set of added, modified, or deleted lines in a specific location. Git automatically divides changes into hunks to facilitate easier review, selective staging, and conflict resolution during version control operations.
Our recommendation: Use a GUI for partial commits
In our opinion, using partial commits on the command line is a bit of a hassle. This would be a good use case for a Git Graphical User Interface (GUI). To checkout how to do partial commits using GitKraken checkout the GUI chapter in this book.
6.7 Ignoring files and folders: .gitignore
.gitignore is a special file. It is used to specify files and folders that should be ignored and not tracked by Git. When you create a .gitignore file, place it inside your Git repository and specify names of files or folders inside it. Git will then exclude these files and directories from being staged or committed, for example when using git add -A which normally stages all your files. The .gitignore file does not have a file extension like .txt and is simply named .gitignore.
The .gitignore file is useful to prevent certain files or directories that are not essential for the project or generated during the development process from being included in the version history. Files that can recreated from the code, like for example .png plots from R code, should also be included in your project’s .gitignore file. Including only essential files in version control keeps your repository clean, making it easier for collaborators to focus on the important project files without distractions from unnecessary files. .gitignore can also help you avoid accidentally committing sensitive information like passwords, API keys, or personal data, which can lead to serious privacy breaches. Huge files, like big data files should also not be committed, since they can slow down working with your repository. To version control files with a big size, tools like DataLad are more appropriate.
To create a .gitignore file from the command line, navigate to your repository and use the touch command. Alternatively, create the file in your favorite text editor.
Code
touch .gitignore
.gitignore will be a hidden file, so to make it show up in the command line, you will have to use the -a flag in the ls command. For details on listing of files and folders, see the chapter on the command line.
Code
ls-a
Now you can write a filename or folder name inside it, to prevent Git from tracking it. If you specify a folder, git will also start to ignore every subfolder in it.
It’s recommended to add a project-specific .gitignore file to your repository using the regular stage-commit workflow that you learned about in this chapter, for example:
git add .gitignoregit commit -m"Add .gitignore"
6.7.1 Global .gitignore file
Instead of specifying files you want Git to ignore for each folder, you can also create a global .gitignore file. To do this, you create a file named .gitignore_global (or any name you prefer) in your users’s home directory (located, for example, at /Users/yourusername). Then, you configure Git to use this global file by running:
After this setup, you can add common files to this list, that you want to ignore. These rules will apply across to all Git repositories on your computer.
Common wildcards
As discussed in the command line chapter wildcards are special characters that represent patterns of filenames or directory names. They can be used to specify multiple files or directories that should be ignored by Git when tracking changes. Wildcards allow you to match multiple files with a single rule, making it more convenient to exclude specific types of files or patterns.
*: Matches any number of characters within a filename. For example, *.txt will match all files with the extension .txt in any directory.
?: Matches a single character within a filename. For example, image?.png will match files like image1.png or imageA.png.
/: Matches the root directory of the repository. For example, /config will match a directory named config only in the root of the repository.
[] (Square Brackets): Matches any single character within the brackets. For example, file[123].txt will match file1.txt, file2.txt, or file3.txt.
!: Negates a pattern and includes files that would otherwise be ignored. For example, !important.txt will exclude important.txt from being ignored, even if there’s a wildcard pattern that matches it.
Common files to ignore in scientific settings
Temporary files and output: Ignore files generated during analysis, like log files, temporary files, or intermediate data files.
Data folders: Exclude large datasets or data stored locally.
Environment-specific files: Exclude environment-specific files like .env files or venv folders used for local development setups.
System-specific files: In macOS, ignore .DS_Store files, and in Windows, ignore Thumbs.db files.
R: Ignore .Rdata files and /Rplots.pdf generated by R for plotting.
Python: Ignore .pyc (Python compiled) files and pycache folders.
LaTeX: Ignore auxiliary files like .aux, .log, .bbl, and .blg files generated during LaTeX compilation.
Example .gitignore file for an R project
# R-specific*.Rproj.user/*.Rhistory.RData.Rproj# R package specific.Rcheck/man/*.RdNAMESPACE# Temporary files*.bak*.csv~*.html*.pdf# RMarkdown-specific*.knit.md*_cache/*_files/# R Markdown Notebook*.nb.html# RStudio Project Files*.Rproj# R Environment Variables.Renviron# R dcf fileDESCRIPTION.meta# Mac-specific.DS_Store
What is the .DS_Store file (macOS only)?
.DS_Store is a hidden file created by macOS in every folder to store custom attributes of the folder, such as the position of icons or the choice of a background image. The filename stands for “Desktop Services Store” and it helps macOS Finder maintain the folder’s view settings. .DS_Store is very commonly ignored from version control using the .gitignore file, for various reasons:
Unnecessary for the project: .DS_Store files contain metadata that are only relevant to the local macOS file system and do not provide any useful information for the project itself. Including them in the repository adds unnecessary clutter.
Avoiding conflicts: Since .DS_Store files are automatically generated by macOS, different users may have different versions of .DS_Store files based on their personal Finder settings. Should .DS_Store be tracked by Git, this can lead to unnecessary version conflicts when collaborating on a project.
Irrelevant for collaborators with another operating system: Anyone else who is not using macOS (for example, Windows or Linux users) do not need these files and they serve no purpose for them. Including .DS_Store in the repository could confuse contributors who are not using macOS.
To add .DS_Store to .gitignore, you can include the following line in your .gitignore file:
Code
.DS_Store
This will ensure that any .DS_Store files in the project directory and its subdirectories are ignored by Git and not tracked in the repository.
6.8 Renaming files with Git
To rename files in Git, use the git mv command. This command not only renames the file but also stages the change for you, so the renaming is ready to be committed. This method is preferable to renaming the file manually, as it allows Git to track the change seamlessly and eliminates the need to stage both the removal of the old file and the addition of the new file separately. Here is an example:
Code
git mv old_filename.txt new_filename.txt
After running this command, Git stages the file rename, allowing you to commit the change like any other.
6.9 Tracking empty folders
Normally, Git does not track empty folders in a repository. However, tracking an empty directory in a Git repository can be useful in some scenarios. For example, you want to track an empty folder as a placeholder for future content or as a way to organize the project more clearly.
A workaround, to get Git to track your empty folder, is adding a hidden .gitkeep file in this folder 1. This not an official Git feature but rather a convention adopted by software developers. That means you could name this file anything, but .gitkeep is used by convention to explicitly state its purpose for keeping otherwise empty folders in Git.
Use-cases for empty folders in Git repositories for scientific projects
As discussed above, Git is not suitable for tracking a lot of large files, as is often the case when working with research data. Imagine that you use Git for your code that you use to analyze data from a research project. Usually, your analysis code needs access to the research data but you think that the research data is too large to be tracked with Git.
Here is what you could do:
Create an empty subfolder in your analysis Git repo, called data.
Add a .gitkeep file to the /data folder and commit it.
Add the /data folder to your .gitignore file.
Add your research data to the /data folder.
This is the result: Your analysis code and research data can now live in the same repository. Your analysis code can access your research data in the /data folder. However, the contents of /data are not committed to the repository, thanks to the .gitignore file. If you later share that repository with collaborators, they will only receive a repository with an empty /data folder, thanks to .gitkeep. The advantage is that your collaborators can easily understand where to put the research data. Note, however, that you still need to make the research data accessible to your collaborators in another way.
If you are interested in version control of larger datasets, you can check out DataLad.
6.10 Cheatsheet
Command
Description
git log
Views past commits
git diff
Views made changes compared to the last commit
git mv
Renames or moves files and automatically stages the changes
6.11 Acknowledgements & further reading
We would like to express our gratitude to the following resources, which have been essential in shaping this chapter. We recommend these references for further reading: