From: Junio C Hamano <junkio@cox.net>
Subject: Re: Make "git clone" less of a deathly quiet experience
Date: Sun, 12 Feb 2006 19:36:41 -0800
Message-ID: <7v4q3453qu.fsf@assigned-by-dhcp.cox.net>
References: <Pine.LNX.4.64.0602102018250.3691@g5.osdl.org>
	<7vwtg2o37c.fsf@assigned-by-dhcp.cox.net>
	<Pine.LNX.4.64.0602110943170.3691@g5.osdl.org>
	<1139685031.4183.31.camel@evo.keithp.com> <43EEAEF3.7040202@op5.se>
	<1139717510.4183.34.camel@evo.keithp.com>
	<46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
Content-Type: text/plain; charset=us-ascii
Cc: Keith Packard <keithp@keithp.com>, Andreas Ericsson <ae@op5.se>,
	Linus Torvalds <torvalds@osdl.org>,
	Git Mailing List <git@vger.kernel.org>,
	Petr Baudis <pasky@suse.cz>
Return-path: <git-owner@vger.kernel.org>
In-Reply-To: <46a038f90602121806jfcaac41tb98b8b4cd4c07c23@mail.gmail.com>
	(Martin Langhoff's message of "Mon, 13 Feb 2006 15:06:42 +1300")

Martin Langhoff <martin.langhoff@gmail.com> writes:

> +1... there should be an easy-to-compute threshold trigger to say --
> hey, let's quit being smart and send this client the packs we got and
> get it over with. Or perhaps a client flag so large projects can
> recommend that uses do their initial clone with --gimme-all-packs?

What upload-pack does boils down to:

    * find out the latest of what client has and what client asked.

    * run "rev-list --objects ^client ours" to make a list of
      objects client needs.  The actual command line has multiple
      "clients" to exclude what is unneeded to be sent, and
      multiple "ours" to include refs asked.  When you are doing
      a full clone, ^client is empty and ours is essentially
      --all.

    * feed that output to "pack-objects --stdout" and send out
      the result.

If you run this command:

	$ git-rev-list --objects --all |
          git-pack-objects --stdout >/dev/null 

It would say some things.  The phases of operations are:

	Generating pack...
	Counting objects XXXX...
        Done counting XXXX objects.
        Packing XXXXX objects.....

Phase (1).  Between the time it says "Generating pack..." upto
"Done counting XXXX objects.", the time is spent by rev-list to
list up all the objects to be sent out.

Phase (2). After that, it tries to make decision what object to
delta against what other object, while twenty or so dots are
printed after "Packing XXXXX objects." (see #git irc log a
couple of days ago; Linus describes how pack building works).

Phase (3). After the dot stops, the program becomes silent.
That is where it actually does delta compression and writeout.

You would notice that quite a lot of time is spent in all
phases.

There is an internal hook to create full repository pack inside
upload-pack (which is what runs on the other end when you run
fetch-pack or clone-pack), but it works slightly differently
from what you are suggesting, in that it still tries to do the
"correct" thing.  It still runs "rev-list --objects --all", so
"dangling objects" are never sent out.

We could cheat in all phases to speed things up, at the expense
of ending up sending excess objects.  So let's pretend we
decided to treat everything in .git/objects/packs/pack-* (and
the ones found in alternates as well) have interesting objects
for the cloner.

(1) This part unfortunately cannot be totally eliminated.  By
    assume all packs are interesting, we could use the object
    names from the pack index, which is a lot cheaper than
    rev-list object traversal.  We still need to run rev-list
    --objects --all --unpacked to pick up loose objects we would
    not be able to tell by looking at the pack index to cover
    the rest.

    This however needs to be done in conjunction with the second
    phase change.  pack-objects depends on the hint rev-list
    --objects output gives it to group the blobs and trees with
    the same pathnames together, and that greatly affects the
    packing efficiency.  Unfortunately pack index does not have
    that information -- it does not know type, nor pathnames.
    Type is relatively cheap to obtain but pathnames for blob
    objects are inherently unavailable.

(2) This part can be mostly eliminated for already packed
    objects, because we have already decided to cheat by sending
    everything, so we can just reuse how objects are deltified
    in existing packs.  It still needs to be done for loose
    objects we collected to fill the gap in (1).

(3) This also can be sped up by reusing what are already in
    packs.  Pack index records starting (but not end) offset of
    each object in the pack, so we can sort by offset to find
    out which part of the existing pack corresponds to what
    object, to reorder the objects in the final pack.  This
    needs to be done somewhat carefully to preserve the locality
    of objects (again, see #git log).  The deltifying and
    compressing for loose objects cannot be avoided.

    While we are writing things out in (3), we need to keep
    track of running SHA1 sum of what we write out so that we
    can fill out the correct checksum at the end, but I am
    guessing that is relatively cheap compared to the
    deltification and compression cost we are currently paying
    in this phase.

NB. In the #git log, Linus made it sound like I am clueless
about how pack is generated, but if you check commit 9d5ab96,
the "recency of delta is inherited from base", one of the tricks
that have a big performance impact, was done by me ;-).