Idea: How to get data streams into and out of a git pack file

Jonathan Fine jfine2358 at gmail.com
Thu Jun 24 15:25:04 CEST 2021


Hi

This is a follow up to my post yesterday, which introduced the large scale
basics of git pack files. This post is how to get data into git, and
retrieve it, without using git as a version control system. Here we
consider only one data stream at a time.

Here's yesterday's post (which contains the details for tonight's TeX Hour,
6:30pm UK time).
Idea: Git as basis for future CTAN and TeX Live.
https://tug.org/pipermail/tex-live/2021-June/047161.html

A BASIC EXAMPLE
Every file known to kpsewhich has a git hash, computed as follows.
$ kpsewhich story.tex
/usr/local/texlive/2019/texmf-dist/tex/plain/knuth-lib/story.tex
$ cat $(kpsewhich story.tex) | git hash-object --stdin
fcbaa4151af32191e6b3e35c90cea8af8ad8fa03

Let's store this data in a Content Addressable Store. First create the store
$ git init cas-example.git
Initialised empty Git repository in /home/jfine/cas-example.git/.git/
$ cd cas-example.git/

Now add the data to the store. The -w means write the file to the store.
cas-example.git $ cat $(kpsewhich story.tex) | git hash-object -w --stdin
fcbaa4151af32191e6b3e35c90cea8af8ad8fa03

Here's the data we just added
cas-example.git$ find .git/objects/ -type f
.git/objects/fc/baa4151af32191e6b3e35c90cea8af8ad8fa03

Now let's get the data back again. We use the git hash as the key.
cas-example.git$ git cat-file -p fcbaa4151af32191e6b3e35c90cea8af8ad8fa03
\hrule
\vskip 1in
\centerline{\bf A SHORT STORY}
\vskip 6pt
\centerline{\sl    by A. U. Thor} % !`?`?! (modified)
\vskip .5cm
Once upon a time, in a distant
  galaxy called \"O\"o\c c,
there lived a computer
named R.~J. Drofnats.

Mr.~Drofnats---or ``R. J.,'' as
he preferred to be called---% error has been fixed!
was happiest when he was at work
typesetting beautiful documents.
\vskip 1in
\hrule
\vfill\eject

DISCUSSION
>From \input story.tex, TeX obtains an input stream. In TeX Live there are
two steps. The first step is to use kpsewhich to produce a file path to the
data stream. The second step is to open the file for reading. This is the
input stream that TeX reads.

This process relies on the data to be streamed being available in a local
store (or otherwise). That local store is here a part of the ordinary
filesystem on the computer.

Here's another way to resolve \input story.tex into an input stream. It
requires some preparation up-front (as does kpsewhich - the lsR database).
First we store all the data we wish to stream in a git content addressable
store. Second, we create a database with entries such as
story.tex fcbaa4151af32191e6b3e35c90cea8af8ad8fa03

This second database is analogous to the ls-R database. By the way, git can
store this database as a tree object, which has a git hash of its own.

It's important to note that the data is decompressed on the fly. And that
because stored at the git hash, there's no duplication of data. Thus, two
trees with only a few changed files can be stored together efficiently.

Expert users will know that kpsewhich can modify its search depending on
the 'engine', such as tex, pdftex, latex, luatex. This can be accommodated
by having the engine as the parent tree entry. Indeed, one can even use the
release year as a grandparent. This would all ALL DATA FILES for ALL
ENGINES and ALL YEARS to be stored in a single git repository.

The command line samples above could be extended to produce such a
repository and associated year-engine-kpsewhich access trees.

We don't yet know how big this single git repository will be. My immediate
estimate is 10 GB, of which 5 GB will be fonts. Recall that about 5GB is
the repository size limit on Github.

Further work would be required to support random access to a data stream. I
don't know which programs in the TeX suite use such random access, and when.

-- 
Jonathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/tex-live/attachments/20210624/352bd309/attachment.html>


More information about the tex-live mailing list.