Notes on Subproject Support
===========================
Junio C Hamano

Scenario
--------

The examples in the following discussion show how this proposal
plans to help this:

. A project to build an embedded Linux appliance "gadget" is
  maintained with git.

. The project uses linux-2.6 kernel as its subcomponent.  It
  starts from a particular version of the mainline kernel, but
  adds its own code and build infrastructure to fit the
  appliance's needs.

. The working tree of the project is laid out this way:
+
------------
 Makefile       - Builds the whole thing.
 linux-2.6/     - The kernel, perhaps modified for the project.
 appliance/     - Applications that run on the appliance, and
                  other bits.
------------

. The project is willing to maintain its own changes out of tree
  of the Linux kernel project, but would want to be able to feed
  the changes upstream, and incorporate upstream changes to its
  own tree, taking advantage of the fact that both itself and
  the Linux kernel project are version controlled with git.

. To make the story a bit more interesting, later in the history
  of development, `linux-2.6/` and `appliance/` directories will
  be renamed to `kernel/` and `gadget/`.

The idea here is to:

. Keep `linux-2.6/` part as an independent project.  The work by
  the project on the kernel part can be naturally exchanged with
  the other kernel developers this way.  Specifically, a tree
  object contained in commit objects belonging to this sub-project
  does *not* have `linux-2.6/` directory at the top.

. Keep the `appliance/` part as another independent project.
  Applications are supposed to be more or less independent from
  the kernel version, but some other bits might be tied to a
  specific kernel version.  Again, a tree object contained in
  commit objects belonging to this sub-project does *not* have
  `appliance/` directory at the top.

. Have another project that combines the whole thing together,
  so that the project can keep track of which versions of the
  parts are built together.  The Makefile is illustrated above,
  but there might be other files and directories.

We will call the project that binds things together the
'toplevel project'.  Other projects that hold `linux-2.6/` part
and `appliance/` part are called 'subprojects'.


Setting up
----------

Let's say we have been working on the appliance software,
independently version controlled with git.  Also the kernel part
has been version controlled separately, like this:
------------
$ ls -dF current/*/.git current/*
current/Makefile    current/appliance/.git/  current/linux-2.6/.git/
current/appliance/  current/linux-2.6/
------------

Now we would want to get a combined project.  First we would
clone from these repositories (which is not strictly needed --
we could use `$GIT_ALTERNATE_OBJECT_DIRECTORIES` instead):

------------
$ mkdir combined && cd combined
$ cp ../current/Makefile .
$ git init-db
$ mkdir -p .git/refs/subs/{kernel,gadget}/{heads,tags}
$ git clone-pack ../current/linux-2.6/ master | read kernel_commit junk
$ git clone-pack ../current/appliance/ master | read gadget_commit junk
------------

We will introduce a new command to set up a combined project:

------------
$ git bind-projects \
	$kernel_commit linux-2.6/ \
	$gadget_commit appliance/
------------

This would probably do an equivalent of:

------------
$ rm -f "$GIT_DIR/index"
$ git read-tree --prefix=linux-2.6/ $kernel_commit
$ git read-tree --prefix=appliance/ $gadget_commit
$ git update-index --bind linux-2.6/ $kernel_commit
$ git update-index --bind appliance/ $gadget_commit
------------
[NOTE]
============
Earlier outlines sent to the git mailing list talked
about `$GIT_DIR/bind` to record what subproject are bound to
which subtree in the current working tree and index.  This
proposal instead records that information in the index file
with `update-index --bind` command.

Also note that in this round of proposal, there is no separate
branches that keep track of heads of subprojects.

`update-index --bind` is not implemented on the core side yet;
it would involve backward incompatible changes to the index
format.
============

Let's not forget to add the `Makefile`, and check the whole
thing out from the index file.
------------
$ git add Makefile
$ git checkout-index -f -u -q -a
------------

Now our directory should be identical with the `current`
directory.  After making sure of that, we should be able to
commit the whole thing:

------------
$ diff -x .git -r ../current ../combined
$ git commit -m 'Initial toplevel project commit'
------------

Which should create a new commit object that records what is in
the index file as its tree, with `bind` lines to record which
subproject commit objects are bound at what subdirectory, and
updates the `$GIT_DIR/refs/heads/master`.  Such a commit object
might look like this:
------------
tree 04803b09c300c8325258ccf2744115acc4c57067
bind 5b2bcc7b2d546c636f79490655b3347acc91d17f linux-2.6/
bind 0bdd79af62e8621359af08f0afca0ce977348ac7 appliance/
author Junio C Hamano <junio@kernel.org> 1137965565 -0800
committer Junio C Hamano <junio@kernel.org> 1137965565 -0800

Initial toplevel project commit
------------

Notice that `Makefile` at the top is part of the toplevel
project in this example, but it is not necessary.  We could
instead have the appliance subproject include this file.  In
such a setup, the appliance subproject would have had `Makefile`
and `appliance/` directory at the toplevel.  The `bind` line for
that project would have said "the rest is bound at `/`" and
`write-tree \--exclude=linux-2.6/` would have been used to write
the tree for that subproject out of the combined index.


Making further commits
----------------------

The easiest case is when you updated the Makefile without
changing anything in the subprojects.  In such a case, we just
need to create a new commmit object that records the new tree
with the current `HEAD` as its parent, and with the same set of
`bind` lines.

When we have changes to the subproject part, we would make a
separate commit to the subproject part and then record the whole
thing by making a commit to the toplevel project.  The user
interaction might go this way:
------------
$ git commit
error: you have changes to the subproject bound at linux-2.6/.
$ git commit --subproject linux-2.6/
$ git commit
------------

With the new `\--subproject` option, the directory structure
rooted at `linux-2.6/` part is written out as a tree, and a new
commit object that records that tree object with the commit
bound to that portion of the tree (`5b2bcc7b` in the above
example) as its parent is created.  Then the final `git commit`
would record the whole tree with updated `bind` line for the
`linux-2.6/` part.


Checking out
------------

After cloning such a toplevel project, `git clone` without `-n`
option would check out the working tree.  This is done by
reading the tree object recorded in the commit object (which
records the whole thing), and adding the information from the
"bind" line to the index file.

------------
$ cd ..
$ git clone -n combined cloned ;# clone the one we created earlier
$ cd cloned
$ git checkout
------------

This round of proposal does not maintain separate branch heads
for subprojects.  The bound commits and their subdirectories
are recorded in the index file from the commit object, so there
is no need to do anything other than updating the index and the
working tree.


Switching branches
------------------

Along with the traditional two-way merge by `read-tree -m -u`,
we would need to look at:

. `bind` lines in the current `HEAD` commit.

. `bind` lines in the commit we are switching to.

. subproject binding information in the index file.

to make sure we do sensible things.

Just like until very recently we did not allow switching
branches when two-way merge would lose local changes, we can
start by refusing to switch branches when the subprojects bound
in the index do not match what is recorded in the `HEAD` commit.

Because in this round of the proposal we do not use the
`$GIT_DIR/bind` file nor separate branches to keep track of
heads of the subprojects, there is nothing else other than the
working tree and the index file that needs to be updated when
switching branches.


Merging
-------

Merging two branches of the toplevel projects can use the
traditional merging mechanism mostly unchanged.  The merge base
computation can be done using the `parent` ancestry information
taken from the two toplevel project branch heads being merged,
and merging of the whole tree can be done with a three-way merge
of the whole tree using the merge base and two head commits.
For reasons described later, we would not merge the subproject
parts of the trees during this step, though.

When the two branch heads use different versions of subproject,
things get a bit tricky.  First, let's forget for a moment about
the case where they bind the same project at different location.
We would refuse if they do not have the same number of `bind`
lines that bind something at the same subdirectories.

------------
$ git merge 'Merge in a side branch' HEAD side
error: the merged heads have subprojects bound at different places.
 ours:
	linux-2.6/
	appliance/
 theirs:
	kernel/
	gadget/
	manual/
------------

Such renaming can be handled by first moving the bind points in
our branch, and redoing the merge (this is a rare operation
anyway).  It might go like this:

------------
$ git reset
$ git update-index --unbind linux-2.6/
$ git update-index --unbind appliance/
$ git update-index --bind $kernel_commit kernel/
$ git update-index --bind $gadget_commit gadget/
$ git commit -m 'Prepare for merge with side branch'
$ git merge 'Merge in a side branch' HEAD side
error: the merged heads have subprojects bound at different places.
 ours:
	kernel/
	gadget/
 theirs:
	kernel/
	gadget/
	manual/
------------
[NOTE]
============
Again, `update-index --unbind` is not implemented yet
on the core side.
============

Their branch added another subproject, so this did not work (or
it could be the other way around -- we might have been the one
with `manual/` subproject while they didn't).  This suggests
that we may want an option to `git merge` to allow taking a
union of subprojects.  Again, this is a rare operation, and
always taking a union would have created a toplevel project that
had both `kernel/` and `linux-2.6/` bound to the same Linux
kernel project from possibly different vintage, so it would be
prudent to require the set of bound subprojects to exactly match
and give the user an option to take a union.

------------
$ git merge --union-subprojects 'Merge in a side branch HEAD side
error: the subproject at 'kernel/' needs to be merged first.
------------

Here, the version of the Linux kernel project in the `side`
branch was different from what our branch had on our `bind`
line.  On what kind of difference should we give this error?
Initially, I think we could require one is the fast forward of
the other (ours might be ahead of theirs, or the other way
around), and take the descendant.

Or we could do an independent merge of subprojects heads, using
the `parent` ancestry of the bound subproject heads to find
their merge-base and doing a three-way merge.  This would leave
the merge result in the subproject part of the working tree and
the index.

[NOTE]
This is the reason we did not do the whole-tree three way merge
earlier.  The subproject commit bound to the merge base commit
used for the toplevel project may not be the merge base between
the subproject commits bound to the two toplevel project
commits.

So let's deal with the case to merge only a subproject part into
our tree first.


Merging subprojects
-------------------

An operation of more practical importance is to be able to merge
in changes done outside to the projects bound to our toplevel
project.

------------
$ git pull --subproject=kernel/ git://git.kernel.org/.../linux-2.6/
------------

might do:

. fetch the current `HEAD` commit from Linus.
. find the subproject commit bound at kernel/ subtree.
. perform the usual three-way merge of these two commits, in
  `kernel/` part of the working tree.

After that, `git commit \--subproject` option would be needed to
make a commit.

[NOTE]
This suggests that we would need to have something similar to
`MERGE_HEAD` for merging the subproject part.  In the case of
merging two toplevel project commits, we probably can read the
`bind` lines from the `MERGE_HEAD` commit and either our `HEAD`
commit or our index file.  Further, we probably would require
that the latter two must match, just as we currently require the
index file matches our `HEAD` commit before `git merge`.

Just like the current `pull = fetch + merge` semantics, the
subproject aware version `git pull \--subproject=frotz/` would be
a `git fetch \--subproject=frotz/` followed by a `git merge
\--subproject=frotz/`.  So the above would be:

. Fetch the head.
+
------------
$ git fetch --subproject=kernel/ git://git.kernel.org/.../linux-2.6/
------------
+
which would fetch the commit chain from the remote repository, and
write something like this to `FETCH_HEAD`:
+
------------
3ee68c4...\tfor-merge-into kernel/\tbranch 'master' of git://.../linux-2.6
------------

. Run `git merge`.
+
------------
$ git merge --subproject=kernel/ \
    'Merge git://.../linux-2.6 into kernel/' HEAD 3ee68c4...
------------

. In case it does not cleanly automerge, `git merge` would write
the necessary information for a later `git commit` to use in
`MERGE_HEAD`.  It may look like this:
+
------------
3ee68c4af3fd7228c1be63254b9f884614f9ebb2	kernel/
------------
+
Similarly, `MERGE_MSG` file will hold the merge message.

With this, a later invocation of `git commit` to record the
result of hand resolving would be able to notice that:

. We should be first resolving `kernel/` subproject, not the
  whole thing.
. The remote `HEAD` is `3ee68c4\...` commit.
. The merge message is `Merge git://\.../linux-2.6 into kernel/`.

and would make a merge commit, and register that resulting
commit in the index file using `update-index \--bind` instead of
updating *any* branch head.


Management of Subprojects
-------------------------

While the above as a mechanism would support version controlling
of subprojects as a part of *one* larger toplevel project, it
probably is worth pointing out that having a separate repository
to manage the subproject independently would be a good idea.
The same subproject can be incorporated into more than one
toplevel projects, and after all, a subproject should be
something that can stand on its own.  In our example scenario,
the `kernel/` project is used as a subproject for the "gadget"
product, but at the same time, the organizaton that runs the
"gadget" project may use Linux on their development machines,
and have their own kernel hackers, not necessarily related to
the use of the kernel in the "gadget" product.

What this suggests is that not just we need to be able to pull
the kernel development history *into* the subproject of the
"gadget" project, but also we need to be able to push the
development history of the kernel part alone *out* *of* the
"gadget" project to another repository that deals only with the
kernel part.

It might go this way.  First the setup:

------------
$ git clone git://git.kernel.org/.../linux-2.6 Linux
$ ls -dF *
cloned/      combined/    current/     Linux/
------------

That is, in addition to the `combined/` which we have been using
to develop the "gadget" product in, we now have a repository for
the kernel, cloned from Linus.  In the previous section, we have
outlined how we update the kernel subproject part of `combined/`
repository from the `kernel.org` repository.  The same procedure
would work for pulling from `Linux/` repository here.

We are now going the other way; propagate the kernel work done
in the "gadget" project repository `combined/` back to `Linux/`.
We might do this at the lowest level:

------------
$ cd combined
$ git cat-file commit HEAD |
  sed -ne 's|^bind \([0-9a-f]*\) kernel/$|\1|p' >.git/refs/heads/linux26
$ git push ../Linux linux26:master
------------

Or, more realistically, since the `Linux` project might already
have their own commits on its `master`:

------------
$ cd Linux
$ git pull ../combined linux26
------------

Either way we would need an easy way to maintain the `linux26`
branch in the above example, and that will have to be part of
the wrapper scripts like `git commit` (more likely, that would
be a job for `git commit \--subproject`) for the usability's
sake; in other words, the `cat-file commit` piped to `sed` above
is not something the end user would do, but something that is
done by the wrapper scripts.

Hopefully the people who work in `Linux/` repository would run
`format-patch` and feed their changes back to the kernel
community.