[twg-tds] storing scripts in the texmf tree

Fri Feb 13 21:43:41 CET 2004

Paul Vojta writes:
> On Fri, Feb 13, 2004 at 07:51:30PM +0100, Olaf Weber wrote:

>> > -  a service (on some port)
>> 
>> Not implemented (yet).
>> 
>> > -  shared memory for subprocesses
>> 
>> As currently implemented, the datastructures are not suitable for
>> this.  In test versions I have been playing with datastructures that
>> are (they have to be independent of the address at which the data is
>> loaded).

> Of course, the logical next step would be a file system (kpsefs).

> :-)

:-P

One possibility I'm exploring is using a 'texmf.zip' for the texmf
tree.  We have again the issue that (like building the hash table from
the ls-R file) building the index into the zip will be a fairly
expensive operation, which could conceivably benefit from things like
daemonizing or sharing.

It is also interesting to see just how large the datastructures have
become.  Take the ls-R file in texmf-dist of TeX-live:

texmf-dist$ wc ls-R
 53964  50188 643296 ls-R

(That's lines, words, and bytes)

To a first approximation, storing every string in the file takes
643296 bytes.  Each "word" is an entry: 16 bytes/hash bucket
<hash,key,value,next>.  Plus an array that indexes to the first bucket
in each chain (currently 15991, lets say 16384), 4 bytes/entry.

   643296 + 16 * 50188 + 4 * 16384
=  643296 +     803008 +     65536
=  643296       868544
= 1511840

The hash table and chains require more room than the actual data, and
we're above a megabyte in total.

(Note: I store the hash in the table because I'm using a different
hash function which should result in a better 32 bit key.  With a
power-of-two table size, I can mask bits to get a bucket number, and
compare hashes before I have to compare strings when looking for the
right bucket in the chain.  This may be overkill, as the average chain
length is a bit over 3 in this case.)

-- 
Olaf Weber

               (This space left blank for technical reasons.)