Git is ignorant by design

Nov 08, 2017 — Tags: Musings

This article explores a very particular problem of Git, namely its ignorance. The main motivation for doing so is that the described problem is a regular source of confusion for me personally. At the same time, I have not been able to find an official term for it, or even just an article describing it, so I wrote one myself.

An example

The concept is best explained using an example. Consider the following output of git show, which shows a change to a piece of Python code:

commit 9dadc2.........
Author: ......
Date: ......

    Refactoring: move method_c and method_d
    
    These methods are moved from OneClass to AnotherClass, as the latter
    is the correct location because of some reasons that would certainly
    exist if this wasn't a toy example.

diff --git a/example.py b/example.py
index 1cde8fb..f487c7e 100644
--- a/example.py
+++ b/example.py
@@ -2,17 +2,17 @@ class AnotherClass(object):
     def method_a(self):
         pass  # implementation of method_a ...
 
-
-class OneClass(object):
-    def method_b(self):
-        pass  # implementation of method_b ...
-
     def method_c(self):
         pass  # implementation of method_c ...
 
     def method_d(self):
         pass  # implementation of method_d ...
 
+
+class OneClass(object):
+    def method_b(self):
+        pass  # implementation of method_b ...
+
     def method_e(self):
         pass  # implementation of method_e

How should we interpret this diff?

Let’s start by reading what the author ostensibly intended. The commit message talks about moving the two methods method_c and method_d from OneClass to AnotherClass.

However, if we look at the actual diff, we see something else entirely: the removal of the definition of OneClass and its method method_b in one place, and the reintroduction of these elements in another place.

It might seem that either the commit message or the actual contents of the commit are wrong. It turns out that neither is the case. Instead, the source of confusion lies in the way git works¹.

Snapshots on text files

Git views the world as a collection of files (usually: text files), of which snapshots are made each time a commit is made. This means that, for the example above, git got to see the following two versions:

Before the commit:

class AnotherClass(object):
     def method_a(self):
         pass  # implementation of method_a ...
 

class OneClass(object):
    def method_b(self):
        pass  # implementation of method_b ...

     def method_c(self):
         pass  # implementation of method_c ...
 
     def method_d(self):
         pass  # implementation of method_d ...
 
     def method_e(self):
         pass  # implementation of method_e

After the commit:

    class AnotherClass(object):
         def method_a(self):
             pass  # implementation of method_a ...
     
         def method_c(self):
             pass  # implementation of method_c ...
     
         def method_d(self):
             pass  # implementation of method_d ...
     

    class OneClass(object):
        def method_b(self):
            pass  # implementation of method_b ...

         def method_e(self):
             pass  # implementation of method_e

First note that these 2 versions actually correspond to the commit message: the 2 methods method_c and method_d are indeed moved from OneClass to AnotherClass.

Now, why does git present a diff that seems to tell a different story? To answer this, we must first understand what git diff and git show do in the first place: Given two versions of a file these commands attempt to calculate the smallest difference between them and present this.

Regarding the meaning of the content of the files Git makes only a very minimal number of assumptions. Most of these have to do with “plain text files”. For example, it knows the special meaning of newline-characters. Using this knowledge the goal of finding a minimal difference is interpreted to mean “with a minimal number of different lines”.

Given these goals, the diff as presented in the original example is exactly what you’d expect: removing OneClass and method_b in one place and adding them in another is simply more efficient than the alternative of removing and adding method_c and method_d. The former affects 5 removed lines and 5 added lines for a total of 10 while the latter affects 2 × 6 = 12 lines. (In both cases empty lines contribute to the numbers). Given that Git attempts to show the smallest diff, it prefers the former options over the latter.

The problem defined

Given that git diff behaves according to specification, what’s the problem? Can we make the suggestion that the original example is confusing more precise? In fact we can!

The key to this is in realizing the following: Our example deals with a piece of Python code, in which concepts such as “classes” and “methods” exist. These concepts have a hierarchical nature: classes are structures that may contain other structures such as methods. When editing the file, we are aware of this structure – in fact, it forms the very reason we edit the file, an intent that the author even captured in the commit message.

Moving part of a hierarchical structure to a different container is quite a useful and natural thing to be able to express. For comparison, consider the following phrases: “moving files to another directory” and “moving marbles to a different vase”. Contrast this with the unnatural alternative of saying that the containers have, so to speak, “wrapped around newly contained items”: We don’t usually say “we’ve moved a directory-ending somewhat, so that it now contains 3 more files”, nor do we say “a vase has been moved to a location where it happens to contain more marbles”.

Git, however, occasionally does precisely that, as demonstrated in the original example: it moves a class definition relative to its contents rather than vice versa.

This is because Git remains agnostic of the structure of the files under its control, noting only that they are text files. Given that it can simply do no better than calculating the smallest difference in number of changed lines. This smallest number of lines happens to include the class definition, rather than the lower-level methods contained therein.

It is for this reason that we can say “Git is ignorant” – it’s ignorant of the structure of the files it’s diffing.

Closing remarks

This concludes the main discussion of the problem. Some final thoughts are below.

First, one might wonder how serious of a problem this really is. On this I can only speak from personal experience: I personally run into some version of this problem approximately once a week. Admittedly, the most regular occurrences are also less confusing than the example at the top of this article, which was contrived specifically to be maximally confusing.

A more regular occurrence is a diff like the below:

--- a/example.py
+++ b/example.py
@@ -1,3 +1,8 @@
 @decorated_function
+def newfunction():
+    pass  # this is newfunction
+
+
+@decorated_function
 def oldfunction():
     pass  # this is oldfunction

In the above example, the more logical diff would identify the first of the 2 decorators (@decorated_function) as being the newly added one. However, git cannot distinguish between the 2 identical occurrences of that line, and identifies the second one as the new one.

If we read this diff like a story, git seems to tell us the following: the decorator of the function oldfunction was repurposed for a newly added function newfunction. The function oldfunction was then given a new (identical) decorator to compensate for its loss. A simpler and more correct story would be: a new function newfunction was added with its own decorator.

A final example uses lists, in which it would be more logical to show the earlier opening bracket as the new addition. Such an approach would lead to a diff with balanced brackets, as opposed to the close-then-open we get now:

--- a/matrix.py
+++ b/matrix.py
@@ -5,6 +5,11 @@ matrix = [
         3,
     ],
     [
+        7,
+        8,
+        9,
+    ],
+    [
         4,
         5,
         6,

In short: examples of the problem are easily found. However: the most regular occurrences are also generally the least confusing.

As of yet, I have not been able to find other people reporting a similar experience. Whether this is because others don’t experience this phenomenon as a problem, or whether they have resigned themselves to Git’s set of assumptions, I do not know. You are invited let me know, possibly by presenting any of your own real-life examples in the comment section below.

Heuristics and fundamental solutions

As noted in the above, git’s ignorance is a fundamentally unsolvable problem: diffs that are based on text alone can never fundamentally know what the underlying structures of those text files were. This does not mean that finding better heuristics for presenting diffs is entirely impossible.

The option --indent-heuristic, introduced in Git version 2.11, attempts to do just that: it uses indentation as a heuristic for hierarchical structure; this corresponds to the reality in many programming languages and data formats.

However, for obvious reasons, a solution using heuristics can only lead to limited results. Thoughts on more fundamental solutions which require no heuristics at all are the subject of the rest of this website.

Git, being the most popular Version Control System, is singled out here; most remarks in this article apply in fact to any modern VCS that ignores the underlying structure of the (text) files under its control. ↩