Lost commit history

It was already evening when the developer contacted me. A patch is missing from the master branch - a deadbeef commit.







I was shown proof: the output of two commands. The first one is



 git show deadbeef 
      
      





- showed changes to the file, let's call it Page.php. The canBeEdited method and its use have been added to it.



And in the output of the second command -



 git log -p Page.php 
      
      





- there was no deadbeef commit. And in the current version of the Page.php file there was no canBeEdited method.



Not finding a solution quickly, we made another patch to the master, laid out the changes - and I decided that I would return to the problem with a fresh mind.



"Offtopic"
, Git. , , .





Was it done on purpose? The file was renamed?



I started searching for the problem by asking for help in the chat of the release engineers team. They are responsible for hosting repositories and automating Git-related processes, among other things. To be honest, they probably could have removed the patch, but they would have done it without a trace.





One of the release engineers suggested running git log with the --follow option. Perhaps the file has been renamed and therefore Git does not show some of the changes.

--follow

Continue listing the history of a file beyond renames (works only for a single file).

(Show file history after renaming it (only works for single files))



There git log --follow Page.php



was a deadbeef in the output , but no file was deleted or renamed. And yet it was not visible that the canBeEdited method was deleted somewhere. The follow option seemed to play a role in this story, but where the changes went was still unclear.



Unfortunately, the repository in question is one of the largest we have. From the moment the first patch was introduced until it disappeared, there were 21,000 commits. It was also lucky that the required file was edited only in ten of them. I studied all of them and did not find anything interesting.



We are looking for witnesses! We need a livebear



Stop! We were just looking for deadbeef? Let's think logically: there must be a commit, let's call it livebear, after which deadbeef is no longer displayed in the file history. Perhaps this will not give us anything, but it will give us some thoughts.



There is a git bisect command for searching the Git history. According to the documentation , it allows you to find the commit in which the bug first appeared. In practice, it can be used to find any moment in history if you know how to determine if that moment has arrived. Our bug was the lack of changes in the code. I could check it with another command - git grep. After all, it was enough for me to know if there is a canBeEdited method in Page.php. A bit of debugging and reading the documentation:



livebear [build]: Merge branch origin / XXX into build_web_yyyy.mm.dd.hh



It looks like a normal merge commit of a task branch with a release branch. But with this commit I managed to reproduce the problem:



$ git checkout -b test livebear^1 2>/dev/null
$ grep -c canBeEdited Page.php
2
$ git merge β€”-no-edit -β€”no-stat livebear^2
Removing …
…
Removing …
Merge made by the β€˜recursive’ strategy.

$ grep -c canBeEdited Page.php
0
$ git log -p Page.php | grep -c canBeEdited
0

      
      





True, I did not find anything interesting in livebear, and its connection with our problem remained unclear. After thinking a little, I sent the results of my searches to the developer: we agreed that even if we get to the truth, the reproduction scheme will be too complicated and we cannot insure ourselves against something like this in the future. Therefore, we officially decided to stop searching.



However, my curiosity remained unsatisfied.



Persistence is not a vice, but a great disgusting



Several more times I returned to the problem, ran git bisect and found more and more commits. All are suspicious, all are mergers, but that gave me nothing. It seems to me that one commit then came across to me more often than others, but I'm not sure that it was he who was the culprit in the end.



Of course I tried other search methods as well. For example, several times I went through 21,000 commits that were made at the time of the problem. It was not very exciting, but I came across an interesting pattern. I ran the same command:



git grep -c canBeEdited {commit} -- Page.php
      
      





It turned out that the "bad" commits, which did not have the required code, were in the same branch! And a search on this thread quickly led me to a clue:



changekiller Merge branch 'master' into TICKET-XXX_description



This was also a merge of two branches. And when trying to repeat it locally, there was a conflict in the required file - Page.php. Judging by the state of the repository, the developer left his version of the file, discarding the changes from the master (namely, they were lost). A long time passed, and the developer did not remember what exactly happened, but in practice the situation was reproduced in a simple sequence:



git checkout -b test changekiller^1
git merge -s ours changekiller^2

      
      





It remains to be seen how a legitimate sequence of actions could have led to such a result. Not finding anything about it in the documentation, I went into the source code.



Is the killer Git?





The documentation said that git log receives multiple commits as input and should show the user their parent commits, excluding the parents of the commits submitted with a ^ in front of them. It turns out that git log A ^ B should show commits that are parents of A and not parents of B.



The command code turned out to be quite complex. There were a lot of different optimizations for working with memory, and in general, reading C code never seemed to me a very pleasant experience. The basic logic can be represented with the following pseudocode:



//    ,   
commit commit;
rev_info revs;

revs = setup_revisions(revisions_range);
while (commit = get_revision(revs)) {
	log_tree_commit(commit);
}

      
      





Here the get_revision function accepts revs, a set of control flags, as input. Each of its calls should seem to give the next commit for processing in the right order (or emptiness, when we got to the end). There is also a setup_revisions function, which fills in the revs structure and log_tree_commit, which displays information on the screen.



I had a feeling that I figured out where to look for the problem. I passed a specific file (Page.php) to the command, because I was only interested in its changes. This means that the git log must have some kind of logic for filtering "extra" commits. The setup_revisions and get_revision functions have been used in many places - hardly the problem with them. That left log_tree_commit.



To my unspeakable joy, in this function there really was some code that calculates what changes were made in a particular commit. I thought the general logic should look something like this:



void log_tree_commit(commit) {
	if (tree_has_changed(commit, commit->parents)) {
		log_tree_commit_1(commit);
}
}

      
      





But the longer I looked at the real code, the more I realized that I was wrong. This function only printed messages. So believe your feelings after that!



I went back to the setup_revisions and get_revision functions. The logic of their work was difficult to understand - the "fog" of auxiliary functions interfered with, some of which were needed to work correctly with pointers and memory. Everything looked as if the main logic was a simple breadth-first traversal of the commit tree, that is, a fairly standard algorithm:



rev_info setup_revisions(revisions_range, ...) {
	rev_info rev;
	commit commit;
	
	//       β€”   
	for (commit = get_commit_from_range(revisions_range)) {
		revs->commits = commit_list_append(commit, revs->commits)
	}
}

commit get_revision(rev_info revs) {
	commit c;
	commit l;

c = get_revision_1(revs);
	for (l = c->parents; l; l = l->next) {
		commit_list_insert(l, &revs->commits);
	}
	return c;
}

commit get_revision_1(rev_info revs) {
	return pop_commit(revs->commits);
}

      
      





A list is created (revs-> commits), the first (topmost) element of the commit tree is placed there. Then, the commits from the beginning are gradually taken from this list, and their parents are added to the end.



Reading the code, I found that among the "fog" of helper functions, there is a complex logic for filtering commits, which I have been looking for so long. This happens in the get_revision_1 function:



commit get_revision_1(rev_info revs) {
	commit commit;
	commit = pop_commit(revs->commits);
	try_to_sipmlify_commit(commit);
	return commit;
}

void try_to_simplify_commit(commit commit) {
	for (parent = commit->parents; parent; parent = parent->next) {
		if (rev_compare_tree(revs, parent, commit) == REV_TREE_SAME) {
			parent->next = NULL;
			commit->parents = parent;
		}
	}
}

      
      





In the case when several branches are being merged, if the state of the file remains the same as in one of them, it makes no sense to consider other branches. If the state of the file has not changed anywhere, we will only leave the first branch.



Example. Let us denote by zero the commits in which the file has not changed, by one - those in which the file has changed, and X - the merge of branches.







In this situation, the code will not consider the feature branch - there are no changes in it. If the file was changed there, then in X the changes were "thrown out", which means that their history is not very relevant: this code is no longer there.



Something similar happened with us. Two developers made changes in one file - Page.php, one in the master branch, in the deadbeef commit, and the second in their task branch.



When the second developer merged changes from the master branch into the task branch, a conflict occurred, in the process of resolving which he simply threw out the changes from the master. Time passed, he finished working on the task, and the task branch was uploaded to the master, thus removing the changes from the deadbeef commit.



The commit itself remained. But if you run git log with the Page.php parameter, you won't see the deadbeef commit in the output.



Optimization is a thankless job



I rushed to carefully study the rules for submitting changes and bugs to Git itself. After all, I thought that I had found a really serious problem: just think, some commits just disappear from the output - and this is the default behavior! Fortunately, the rules turned out to be voluminous, the time was late, and the next morning my fuse was gone.



I realized that this optimization greatly speeds up Git performance on large repositories like ours. There is also documentation for it in man git-rev-list , and this behavior can be turned off very easily.



By the way, how is --follow involved in this story?



In fact, there are many ways to influence how this logic works. Specifically, about the follow flag in the Git code, a comment was found 13 years ago:



Can't prune commits with rename following: the paths change.

(Translation: Can't throw commits when renaming is in progress: paths can change)





PS

I myself have been part of Badoo's release engineering team for several years now, and many in the company believe that we understand Git.





(Translation. Original: xkcd.com/1597 )



In this regard, we have to deal with the problems that arise in this system, and some of them seem to me quite curious - like, for example, described in this article. Very often problems are solved quickly: we have already encountered many things, something is well described in the documentation. This case was an exception.



In fact, the documentation did indeed have a History Simplification section, but it was only for the git rev-list command and I didn't think to look there. Six months ago, this section was included in the manual of the git log command, but our case happened a little earlier - I simply did not have time to finish this article. (*)



And finally, I have a small bonus for those who have read to the end. I have a very small repository where the problem is reproduced:



$ git clone https://github.com/Md-Cake/lost-changes.git
Cloning into 'lost-changes'...
…

$ git log --oneline test.php
edfd6a4 master: print 3 between 1 and 2
096d4cf init

$ git log --oneline --full-history test.php
afea493 (HEAD -> master, origin/master, origin/HEAD) Merge branch 'changekiller'
57041b8 (origin/changekiller) print 4 between 1 and 2
edfd6a4 master: print 3 between 1 and 2
096d4cf init

      
      





Thanks for attention!



(*) UPD: It turned out that the History Simplification section had been in the documentation of the git log command for much longer than six months, and I just skipped it. Thank you youROCKthat drew attention to this!



All Articles