Git stores diffs, not snapshots. A commit is a changeset, literally a patch that you can export with `git diff`. Ordering of is layered on top of that and informs things like merges.
Internally, git objects are snapshots -- the differences are computed on demand. That is a critical difference, because it has implications for how merging can work. For example: https://tahoe-lafs.org/~zooko/badmerge/simple.html
Even when using git am or send-email or whatever, yes, you're sending a patch -- but the way to apply that patch is to turn it into a commit and then cherry pick or rebase or merge or manually fix conflicts or whatever. In darcs and pijul, the model is _always_ that set of patches.
So, I didn't believe you because my internal mental model of a git repo is a DAG of changesets. And indeed, that is a good way to think about it, because almost all of git's operations behave like this. Commit history is almost always presented to the user as a series of diffs.
But, you are correct. Internally, git stores the full contents of files and computes the diffs on the fly.
For others' benefit, if you want to test for yourself, create a new repo, add a file and make a series of commits with changes. Git objects are compressed with DEFLATE (zlib) so gunzip and unzip won't work. I used https://github.com/jezell/zlibber because I was too lazy to write my own quick zlib wrapper. Then doing
for o in .git/objects/*/*; do cat "$o" | inflate ; echo ""; done
lists the contents of all the git objects. Notice that there are no changesets, only full copies of the file you modified at different states.
This was surprising to me, since I had a very different model mentally. I still think the DAG of diffs is the better model mentally, but it is worth understanding that this is not what git is actually doing under the hood. It explains issues that arise doing rebases, cherry-picks, etc.
I now also understand the motivation behind Pijul. If I understand correctly, Pijul does use a collection of changesets as the underlying model. Like you say, that can be a critical difference.
I thought git stored diffs too. But I don't understand what difference it makes, isn't the commit storage format an implementation detail? If you want to get to B from A, then I can store B or store B-A. When I have B and want to show a diff, I can calculate B-A. When I have B-A and want to show a diff, it's a no-op. When I have B and want to work, it's a no-op, but when I have B-A and want to work, I have to populate the working tree with A + B-A. It doesn't feel like there's a fundamental difference in the model either way.
But if you don't have A in the first place, then there's a real problem, how to take B-A and reconstruct B, without knowing what A is. You have to find A, and there may be multiple acceptable A's. It seems like the fundamental difference here would be not having a strict "parent" for any given patch. I can see why that would make some workflows a little nicer, not being forced to rebase, but I don't see any massive advantages -- as a user what does this really buy me? Does it enable some things like that are impossible with git? Or does it mainly make some advanced git workflows easier?
The difference is more than just the storage format. You could compute snapshots on demand and just store the diffs (that'd be impossibly slow, but you could), but you still would have git's DAG instead of darcs' theory of patches, with all of the consequences of that. (I'd explain them here, but I've already explained it in sibling comments on this thread).
Maybe lvh or one of the others who has more experience with darcs of Pijul will chime in. They've probably spent more time thinking about this.
One difference is that by storing the patches only you can understand more clearly what the intended change was. When you store the whole file it is easy to compute the difference between A and B, but may be impossible to compute the correct differences between A, B, and C. By storing the whole file you now have to consider all the possible differences between them, not just the ones introduced by the commits you are trying to merge.
I would have to play around with it, but I know there are scenarios involving rebase, revert, and cherry-picking commits that can cause trouble in git that I now understand comes because of the fact that git is storing contents, not diffs.
One that I've run into regularly is cherry-picking commits from a dev branch into a master branch to hot-fix bug fixes directly into a prod release instead of waiting until dev gets merged as part of our regular process. If I had commit A on dev and cherry-pick it to master it creates a totally new commit A-1 that becomes part of the history of master. We lost the fact that A and A-1 represent the exact same changeset. Depending on what the changes are, and what further changes happen on dev afterwards, this can cause failed merges requiring manual resolution when dev does finally get merged into master.
Those reasons make sense. Would it be fair to summarize all that as "better merging with fewer merge conflicts"?
Git certainly has some room to improve in the merge conflict department. I looked at the bad merge example you posted -- I suspect I've hit that before. It's rare, but yeah it's there.
I also frequently notice that git complains about merge conflicts, while the custom diff tool I use to resolve them says there's no conflict, and I don't actually have to do anything. Good reason to use a custom merge tool with git.
But, given all this, is this really all an outcome of patches vs snapshots, or is this just git's merge algorithm being suboptimal? Certainly git could selectively ignore the DAG when merging, couldn't it? Even after reading the other comments here, it still seems to me like git has more information when merging than the "patch-based" workflow of darcs & Pijul.
It seems to me like there's a language problem with trying to draw a distinction between patches and snapshots. Git is still storing and transferring patches at the tree level, even if it's not happening at the file level. Git does not store a commit as a zip snapshot of the entire tree, the commit is still only the changed files. It would be fair (but not standard or common) to call the overlay of changed files a "patch" or a "diff". People do still use git format-patch, and email git "patches" to each other. So it's inherently confusing & problematic to talk about git and say that it doesn't use patches.
What does make sense to me is the distinction of having a strict DAG vs not having one -- is that actually what people mean when they talk about snapshots vs patches? Am I tripping on it because I'm being too pedantic about what a "patch" is?
Right, the raw representations are convertible (at some point a patch-oriented darcs/pijul has to build a snapshot so that it can build a working tree; at various times you want to see the diffs in git or format a patch file to email). It does have more to do with the representation of change context both between patches (strict DAG versus algebraic sets with looser change context models), and even to some extent within a patch (in a classic diff the tools use hardcoded line numbers; in something like darcs/pijul even the line numbers of a patch aren't necessarily taken as a given and are a part of the context of the change).
What other operations behave as if it's at its core diffs and not snapshots?
merging, rebasing, committing... all operate on refs. You might think you're transplanting changes (and you are), but the inputs are refs, and the outputs are refs. As you mentioned, refs are unambiguously snapshots.
git rebase -i does not demonstrate that gits internal model involves a DAG of snapshots (objects); it only demonstrates that git is sometimes willing to move the contents of one of those snapshots around to create a new snapshot. That is very different from a patches-always model, as I have illustrated in a sibling comment with an example link.
rebase turns the snapshots temporarily into a stack of patches, lets you play with them, then turns them back into snapshots.
In fact, this is why rebase is an out-of-band tool that has odd effects on shared history - specifically because it's inverting git's model into something more like pijul's, and therefore isn't really native-to-git.
Internally, git objects are snapshots -- the differences are computed on demand. That is a critical difference, because it has implications for how merging can work. For example: https://tahoe-lafs.org/~zooko/badmerge/simple.html
Even when using git am or send-email or whatever, yes, you're sending a patch -- but the way to apply that patch is to turn it into a commit and then cherry pick or rebase or merge or manually fix conflicts or whatever. In darcs and pijul, the model is _always_ that set of patches.
So, I didn't believe you because my internal mental model of a git repo is a DAG of changesets. And indeed, that is a good way to think about it, because almost all of git's operations behave like this. Commit history is almost always presented to the user as a series of diffs.
But, you are correct. Internally, git stores the full contents of files and computes the diffs on the fly.
For others' benefit, if you want to test for yourself, create a new repo, add a file and make a series of commits with changes. Git objects are compressed with DEFLATE (zlib) so gunzip and unzip won't work. I used https://github.com/jezell/zlibber because I was too lazy to write my own quick zlib wrapper. Then doing
lists the contents of all the git objects. Notice that there are no changesets, only full copies of the file you modified at different states.
This was surprising to me, since I had a very different model mentally. I still think the DAG of diffs is the better model mentally, but it is worth understanding that this is not what git is actually doing under the hood. It explains issues that arise doing rebases, cherry-picks, etc.
I now also understand the motivation behind Pijul. If I understand correctly, Pijul does use a collection of changesets as the underlying model. Like you say, that can be a critical difference.
That is the most basic storage format in git, but it has packfiles too where it uses deltas. But not necessarily changesets' diffs!
I thought git stored diffs too. But I don't understand what difference it makes, isn't the commit storage format an implementation detail? If you want to get to B from A, then I can store B or store B-A. When I have B and want to show a diff, I can calculate B-A. When I have B-A and want to show a diff, it's a no-op. When I have B and want to work, it's a no-op, but when I have B-A and want to work, I have to populate the working tree with A + B-A. It doesn't feel like there's a fundamental difference in the model either way.
But if you don't have A in the first place, then there's a real problem, how to take B-A and reconstruct B, without knowing what A is. You have to find A, and there may be multiple acceptable A's. It seems like the fundamental difference here would be not having a strict "parent" for any given patch. I can see why that would make some workflows a little nicer, not being forced to rebase, but I don't see any massive advantages -- as a user what does this really buy me? Does it enable some things like that are impossible with git? Or does it mainly make some advanced git workflows easier?
The difference is more than just the storage format. You could compute snapshots on demand and just store the diffs (that'd be impossibly slow, but you could), but you still would have git's DAG instead of darcs' theory of patches, with all of the consequences of that. (I'd explain them here, but I've already explained it in sibling comments on this thread).
Maybe lvh or one of the others who has more experience with darcs of Pijul will chime in. They've probably spent more time thinking about this.
One difference is that by storing the patches only you can understand more clearly what the intended change was. When you store the whole file it is easy to compute the difference between A and B, but may be impossible to compute the correct differences between A, B, and C. By storing the whole file you now have to consider all the possible differences between them, not just the ones introduced by the commits you are trying to merge.
I would have to play around with it, but I know there are scenarios involving rebase, revert, and cherry-picking commits that can cause trouble in git that I now understand comes because of the fact that git is storing contents, not diffs.
One that I've run into regularly is cherry-picking commits from a dev branch into a master branch to hot-fix bug fixes directly into a prod release instead of waiting until dev gets merged as part of our regular process. If I had commit A on dev and cherry-pick it to master it creates a totally new commit A-1 that becomes part of the history of master. We lost the fact that A and A-1 represent the exact same changeset. Depending on what the changes are, and what further changes happen on dev afterwards, this can cause failed merges requiring manual resolution when dev does finally get merged into master.
I imagine that would not be a problem for Pijul.
I think you've hit the nail on the head :)
Those reasons make sense. Would it be fair to summarize all that as "better merging with fewer merge conflicts"?
Git certainly has some room to improve in the merge conflict department. I looked at the bad merge example you posted -- I suspect I've hit that before. It's rare, but yeah it's there.
I also frequently notice that git complains about merge conflicts, while the custom diff tool I use to resolve them says there's no conflict, and I don't actually have to do anything. Good reason to use a custom merge tool with git.
But, given all this, is this really all an outcome of patches vs snapshots, or is this just git's merge algorithm being suboptimal? Certainly git could selectively ignore the DAG when merging, couldn't it? Even after reading the other comments here, it still seems to me like git has more information when merging than the "patch-based" workflow of darcs & Pijul.
It seems to me like there's a language problem with trying to draw a distinction between patches and snapshots. Git is still storing and transferring patches at the tree level, even if it's not happening at the file level. Git does not store a commit as a zip snapshot of the entire tree, the commit is still only the changed files. It would be fair (but not standard or common) to call the overlay of changed files a "patch" or a "diff". People do still use git format-patch, and email git "patches" to each other. So it's inherently confusing & problematic to talk about git and say that it doesn't use patches.
What does make sense to me is the distinction of having a strict DAG vs not having one -- is that actually what people mean when they talk about snapshots vs patches? Am I tripping on it because I'm being too pedantic about what a "patch" is?
Right, the raw representations are convertible (at some point a patch-oriented darcs/pijul has to build a snapshot so that it can build a working tree; at various times you want to see the diffs in git or format a patch file to email). It does have more to do with the representation of change context both between patches (strict DAG versus algebraic sets with looser change context models), and even to some extent within a patch (in a classic diff the tools use hardcoded line numbers; in something like darcs/pijul even the line numbers of a patch aren't necessarily taken as a given and are a part of the context of the change).
A video referred to by another user on this thread puts it very clearly: https://news.ycombinator.com/item?id=13645102
What other operations behave as if it's at its core diffs and not snapshots?
merging, rebasing, committing... all operate on refs. You might think you're transplanting changes (and you are), but the inputs are refs, and the outputs are refs. As you mentioned, refs are unambiguously snapshots.
Git does store snapshots, not diffs. Look into the files in .git/objects/* ;)
Not sure why you're getting downvoted, you're describing git correctly.
If anyone is questioning this, just play with `git rebase -i [some old changeset id]` and you'll see that it's just an ordering of patches.
git rebase -i does not demonstrate that gits internal model involves a DAG of snapshots (objects); it only demonstrates that git is sometimes willing to move the contents of one of those snapshots around to create a new snapshot. That is very different from a patches-always model, as I have illustrated in a sibling comment with an example link.
rebase turns the snapshots temporarily into a stack of patches, lets you play with them, then turns them back into snapshots.
In fact, this is why rebase is an out-of-band tool that has odd effects on shared history - specifically because it's inverting git's model into something more like pijul's, and therefore isn't really native-to-git.