Long (font) filenames (Was: Re: fontinst with 8y.etx)

Lars Hellström Lars.Hellstrom@math.umu.se
Thu, 18 Jun 1998 17:49:10 +0200 (MET DST)


Lately there has been some debate on the topic of long (longer than eight
characters) filenames, and in particular about fonts with long filenames
(not very surprising, considering where the debate is held).

I personally believe that filenames in the TeX world will throw off this
increasingly antiquated boundary, it is only a matter of when this will be,
and furthermore I would say the sooner this happens the better. There are
however several who argue against doing this now, but I feel that the only
argument (of the ones I have read) that is not simply of the kind "this
particular piece of technology does not allow that at the moment" came from
Rowland, who wrote:
>
>I'd say not: the 8+3 filename limit applies to ISO9660 CD-ROMs, as well as
>MS-DOS.  MS-DOS is still widely used, especially in `developing nations'
>which have plenty of obsolete computers still in use.  I wouldn't be happy
>with excluding people just because they happen to live in the `wrong'
>country.
>
My experience is that technological problems usually gets solved, given
that there is a will to do this, but if the problem rather is of a
finacinal nature then the prognosis is not that good.

Well then, would it be possible to do a smooth transition to filenames
longer than eight characters, so that people who are stuck on old systems
can still participate in the TeX world, while people who are not can go on
and use long filenames? One way to achieve this would be to somehow emulate
long filenames on the old 8+3 character systems, MS-DOS in particular. A
scheme for doing this would at least have to satisfy the following
requirements:

  1. Uniqueness. Files that have different long names must be different
files even to a file system which only handles short names.
  2. Compability with software. TeX and related software must be able to
use the long filename files, after being only slightly modified, or not
changed at all.
  3. Compability with institutions. The scheme must be possible to easily
integrate with existing TeX institutions, most notably the CTAN archives.

Below, I will describe a scheme for doing this and give some analysis on
how well it meets these requirements.

In a second section, I describe an outline I made a few days ago of how an
updated Berry-type scheme for font names might look, if it is to take
advantage of filenames longer than eight characters. It is my hope and
intention that it might be the starting point of a discussion about this.


A scheme for emulating long filenames:
======================================

Starting with trying to solve the uniqueness problem, one solution would be
to split the intended long filename over more than one directory level.
More
precise, if the intended name of a file is for example

  abcdefghijklmnopqrstuvwx.tex

then one would separate the first eight characters from the filename and
take that as the name of a directory in which the file system would place a
file whose contents are the contents of the intended file
abcdefghijklmnopqrstuvwx.tex and whose name consists of the intended
filename minus the first eight characters that were used for the directory
name, that is, the above will have become

  abcdefgh\ijklmnopqrstuvwx.tex

(I am using backslashes for separating directory and content here, as the
discussion mainly pertains to MS-DOS.) As this would also make the file
name too long however, the process is better applied recursively, yielding

  abcdefgh\ijklmnop\qrstuvwx.tex

which actually is a legal MS-DOS name.

(Note that I do not suggest that the above procedure should be used with
file systems that can handle filenames longer than eight characters.  In
such a system, abcdefghijklmnopqrstuvwx.tex should still be
abcdefghijklmnopqrstuvwx.tex.)  So far the only thing this procedure must be
able to do is ensure that files with different intended names will be
treated as separate files by the file system.

One problem with the above is that there could also be a perfectly good
reason to already have a file qrstuvwx.tex in the abcdefgh\ijklmnop\
subdirectory, so this is not yet quite unique (although it works better
than a simple truncation).  One way of improving this would be to add a
specific suffix to the names of directories introduced by this separation
procedure, to exemplify this I have chosen "et" for latin "and". This would
change the above into

  abcdefgh.et\ijklmnop.et\qrstuvwx.tex

Directories with suffixes are in my experience very rare, but fully legal,
and reserving one suffix out of the more than 40000 that can be formed from
three letters or digits can, in my opinion, hardly be considered a serious
restraint on the naming of files. If it should turn out that "et" is
already used for suffixes of some other type of directories however, we
should of course choose some other suffix.

The above splitting up works fine for files, but I believe doing the same
thing with names of directories would not work as smoothly.  If the
directory abcdefghij was split up in the same way, it would become

  abcdefgh.et\ij

with the side effect that whilst

  abcdefghij\..\a.tex

would be the same thing as a.tex,

  abcdefgh.et\ij\..\a.tex

would correspond to the file abcdefgha.tex.  To avoid such problems, it is
most likely wise to only apply the splitting up procedure described above on
files.

If the above procedure is looked at from the point of view of filenames as
strings, the split filename is constructed from the intended by inserting
the string ".et\" after every eighth character in the intended filename until
there is at most eight characters left before the suffix.  This description
might be simpler to comprehend.

Theoretically the above procedure could convert arbitrarily long filenames,
but I would suggest that a recommendation should be issued that "long" file
names should not be made longer than 24 characters (although any software
used to split up long filenames should not impose this restriction), and
that is for  the following reasons:

  1. Very few words are longer than, say, 20 characters (but many are
longer than eight characters), so the demand for filenames longer than 24
characters is not likely to be very large anyway.
  2. 24 is a multiple of 8. Choosing a limit that is not a multiple of
eight is not as effective in terms of name space gained / storage space
used.
  3. I do not know of any major file system that imposes a name length
limit between 8 and 24, but I do know of one (HFS for MacOS) that imposes a
length limit between 24 and 32 (which would be the next multiple of eight).

Of course, the relevance of the above reasons might be argued about, but I
feel the suggestion made is quite reasonable.


Anyway, the above procedure does ensure uniqueness, but can it be used
together with existing software?  Most tool programs used with TeX should
not be concerned at all since they simply work on the files whose names are
explicitly given to them; it does not matter if the files happen to be in a
directory whose suffix is "et".

DVI and VF files are a bit special, as they actually contain names of files
(fonts), but these names are supposed to be able to consist partly of names
of directories anyway.  This could complicate exchange of these files
between users on eight character filesystems which have split some
filenames and users on more-than-eight character filesystems which have
not, but the file formats are simple enough for writing a program performing
the necessary conversion to be no more than a "simple programming
exercise".  (Filenames in \specials are somewhat more complicated, but then
the users must also have agreed on which DVI driver they are using.
Conversion gets more complicated, but not unreasonably so.)

The problems start instead with TeX itself, since conversion of input files
by some external program would be much too hard (theoretically impossible,
or at least close to being impossible).  The neat solution would of course
be that the TeX implementations operating on filesystems which impose an
eight character limit would be slightly rewritten so that they instead of
truncating long filenames (or whatever they do), they would split them up as
described above.  It is however almost certain that this will not be done
with all implementations, as some are bound to be commercial and written by
companies which either do not exist anymore or are unwilling to make the
rewrite.  Thus it is interesting to look at alternative solutions as well.

The simplest is that users with implementations that do not automatically
split up filenames instead do that manually.  This is probably the optimal
solution from an result/work point of view, but there are most likely cases
in which it would not be good at all.

It is possible to write systems of macros which can act like \input, \font,
\openin, and \openout and which parse their filename parameter in such a
way that long names are split as described above, but these would most
likely need to make assignments and can therefore not simply be substituted
for the primitives in general.  They could however be a valuable complement
to manual splitting up of filenames.

LaTeX already has some filename parsing built in, but it is hard to tell
whether it could perform the splitting up as described above reliably.
LaTeX does however handle file names on a fairly high level, so if
splitting up of long filenames indeed would become a common practice, then
I believe the introduction of support for this in LaTeX would be sure to
follow.

fontinst and docstrip are two other programs that I feel it is worth to
mention in this context, as they have (at least indirectly) generated a
fair amount of the files we see out there.  Being programs, adjusting these
for splitting up filenames in systems where this is required should be no
problem (as for fontinst, this could be handled through a few redefinitions
in the fontinst.rc file).

This does however introduce the question of what various TeX
implementations do when they are told to create a file in a directory which
does not exist.  If the directory is created, things would be fine, but
are there TeXs which behave differently?  Does anyone reading this have any
experience in this matter?


The final topic to examine would seem to be how well the CTAN archives
would cope with such transitions.  Not knowing much about the inner
workings of these, I might of course be talking through my hat here, but it
seems to me as if the problems would be relatively small.

What I have been thinking about for the FTP archives is for the current
archive to have a mirror image in which all filenames longer than eight
characters have been split up as described above, then people with 8+3
character filesystems can get their files from the mirror and people who do
not can get their files from the non-split original.  As to how this would
be implemented, there seems to be several solutions (and I have most likely
not found the best).  One would be to have the split up mirror image
generated dynamically by the FTP server.  This does not seem to be more
advanced than the dynamic generation of packed files many FTP servers
offer, but it could of course still be quite complex.  Another solution
seems to be to have both the split up mirror image and the original archive
to exist as real directory structures in the underlying filesystem, but
have all the files identified (by using links or some other suitable
mechanism).  In any case, the most limiting factor to take into account is
probably how intelligent the other FTP sites mirroring CTAN do this, as
they could end up with two copies of every file if they are unlucky.

Given such a split up mirror image of CTAN, the CDs containing CTAN (and
which is often mentioned as another important argument against longer
filenames than eight characters) could simply consist of a download of the
mirror image instead.  Of course, people which do not have 8+3 character
filesystems would like the filenames not to be split up, but writing a
program which copies file trees and recombines filenames in the process
would just be another "simple programming exercise".


In summary, the above described scheme for emulating long filenames in
file systems which only allow short ones ensures uniqueness of files for
practical situations, works fine with tools used with TeX, is not completely
trouble-free when used with TeX itself (although this can be remedied
with a simple update of the TeX implementation), and seems to be possible
to use for the CTAN archives as well.  It seems to me then, that it should
be worth considering.

(Considering how long the above is, there most likely is something wrong
somewhere.  In any case, an open discussion should show whether this
expected error is a serious one or not. :-)


Long filenames for fonts
========================

Filenames longer than eight characters do offer some additional
flexibility, and one thing that I think would clearly benefit from using
this would be Karl Berry's naming system for fonts.  I like the basic idea
of this system, which I interpret to be to name fonts according to all their
attributes rather than as, which is common in many other instances of
computerised typesetting and related activities, simply name them so that
every font has a unique name (but which may ignore several interesting
characteristics of the font).  How then, would this system be changed to
benefit from not having to stick to the eight character limit?

What follows is one suggestion on how the character positions could be
distributed among the various attributes and some of mine reflexions on it.
The character positions are taken in order, from left to right. In many
cases, the assignments are probably overkills.

Supplier: 1 character
  This would basically be the same as in the present Berry scheme.

Typeface (family): 4 characters
  Four characters is probably overkill, as this allows for more than 1.6
  million families if both letters and digits are used, but at least three
  characters have become necessary by now. One use for a fourth could be
  to group the fonts into subfamilies; taking computer modern as an
  example one can consider the Roman fonts as one, the sans serifs as
  another, the typewriter fonts as a third, and the math fonts as a fourth.

Weight: 1 character
  This would be as in the present Berry scheme. Although some fonts exists
  in many weights, there does not seem to be _that_ many weights.

Width: 1 character
  This would also be as in the present Berry scheme, although it would
  never be omitted (which is presently quite common). It makes sense to me
  to place it next to the weight, as these are more closely related to each
  other than to the rest of the attributes.

Encoding: 2 characters
  There seems to be pretty many of these around, so two characters are
  probably needed; besides, I see no reason to change the names that are
  already in use. In addition to these, there would have to be included  codes
  for such messages as "User specific encoding nr. N" and "Haven't got the
  faintest idea about what this encoding is called".
    One thing to consider however, and which has much to do with
  encodings, is that if the font naming scheme is to have a major
  update, would it not be of interest to consider widen it so that it
  includes fonts for non-european languages as well?  Would two characters
  be enough for all encodings if for example Japanese fonts were to be
  included?

Size: 5 characters
  This is most certainly overkill, but at least it allows for a simple
  interpretation, as the ec fonts describe size in units of 0.01pt and
  fonts for sizes above 99pt are uncommon but do exist (even though I have
  no idea about whether any has ever been used with TeX). Not caring about
  that for the moment, one may observe that one advantage of having the
  size field in a fixed position is that letters can be used here to convey
  other information about size than absolute size. One could for example
  choose to specify that a size field of 0p917 would mean that the font
  has been scaled to 0.917 of its original size (this is Melissa's font at
  11/12 of its original size). Neither to be forgotten is the code for "No
  particular size", which could simply be xxxxx.

Variants: As many characters as it takes
  With this field being of highly variable length, I considered it better
  to have it last (this ought to make things easier for fontinst's
  \latinfamily and other similar software which rely on interpreting font
  names). As for many of the other fields, I see no reason to change the
  interpretation of the variants, although some would become redundant (r
  is no longer needed as a placeholder, the encodings no longer interfere).
    Another topic here is how the variants should be ordered (a program
  that attempts to find a font by generating its name should benefit from a
  clear rule in this case). To me, the simplest rule seems to be to order
  the variant letters alphabetically, but should there exist two-character
  variants in the future too then this might not be the case (in that case
  I would prefer deglex ordering: first by length, then lexicographically).

If the length of all fixed-size fields above are summed up, this gives a
total of 14 characters. Hence one can include as many as 10 characters for
variants without passing the recommended upper bound of 24 mensioned in the
first section. This should be sufficient for most cases.

Even if a discussion should end in the conclusion that long filenames are
not practical in the TeX world at the moment, I still believe that an
extended Berry scheme somewhat similar to the above would be useful to have
made up, although it might not be possible to use it for filenames. It for
example could be enlightning to have such a code as this one included as a
new field in the lists of TeX font names and corresponding printer font
names some people have mensioned now and then. I for one would find it
interesting to know what the original size of a PS font is (as many are
digitalisations of fonts cut in metal, at one specific size).


But it is most likely best to stop here and let the discussion begin...

Lars Hellström