Font encoding specifications (Was: wrong ligatures in ae and zett fonts)

Lars Hellström Lars.Hellstrom@math.umu.se
Mon, 25 Jun 2001 20:47:41 +0200


At 01.11 +0200 01-06-24, Uwe Koloska wrote:
>Am Samstag, 23. Juni 2001 22:40 schrieb Lars Hellström:
>> I'm not so sure it is questionable (anymore; I too thought it was when I
>> wrote the monowidth/typewriter comments in the v1.9 fontinst sources).
>> The main problem is that `--' _is_ allowed input for generating an endash
>> in the T1 encoding, and hence all fonts with that encoding should work
>> like that.
>
>I thought that an encoding describes only the mapping of a number to a
>glyph.  Isn't this true?
>
>If ligatures are part of an encoding, then all the sources I know that
>claim to describe T1 (or LY1, or AdobeStandard) are at least incomplete.

It all boils down to what an encoding really is and how it is specified. It
just so happens that I some days ago completed a paper (encspecs.tex) on
the subject, which can be found in

  http://abel.math.umu.se/~lars/encodings/

I had intended it for discussion on the LaTeX-L mailing list, but at the
time of writing I haven't yet recieved any replies on this and if I don't
get any within the next few days then maybe the discussion should be moved
here as it seems to have begun here by itself anyway. The reason I thought
about having it on LaTeX-L first is that a fundamental ingredient in the
view expressed in the aforementioned paper is the existence of an internal
representation format for characters---something which LaTeX has (the LICR)
but e.g. PlainTeX lacks---but I suspect the discussion could be held anyway.

The following excerpt from encspecs.tex could help explain my basic
approach to the matter:
%%%%%%%%%%%%%%%%%%%%
On its way out of \LaTeX\ towards the printed text, a character passes
through a number of stages. The following five seem to cover what is
relevant for the present discussion:
\begin{enumerate}
  \item \emph{\LaTeX\ Internal Character Representation} (LICR)~%
    \cite{LICR}. At this point the character is a character token
    (e.g.~|a|), a text command (e.g.~|\ss|), or a combination
    (e.g.~|\H{o}|).
  \item \emph{Horizontal material;} this is what the character is
    en route from \TeX's mouth to its stomach. For most characters
    this is equivalent to a single |\char| command (e.g.\ |a| is
    equivalent to |\char|\,|97|), but some require more than one, some
    are combined using the |\accent| and |\char| commands, some
    involve rules and\slash or kerns, and some are built using boxes
    that arbitrarily combine the above elements.
  \item \emph{DVI commands;} this is the DVI file commands that
    produce the printed representation of the character.
  \item \emph{Printed text;} this is the graphical representation of
    the character, e.g. as ink on paper or as a pattern on a computer
    screen. Here the text consists of glyphs.
  \item \emph{Interpreted text;} this is essentially printed text
    modulo equivalence of interpretation, hence the text doesn't really
    reach this stage until someone reads it. Here the text consists of
    characters.
\end{enumerate}

In theory there is a universal mapping from LICR to interpreted text,
but various technical restrictions make it impossible to simultaneously
support the entire mapping. A \LaTeX\ encoding selects a restriction
of this mapping to a limited set which will be ``well supported''
(meaning kerning and such between characters in the set works), whereas
elements outside this set at best can be supported through temporary
encoding changes. The encoding also specifies a decomposition of the
mapping into one part which maps LICR to horizontal material and one
part which maps horizontal material to interpreted text. The first
part is realized by the text command definitions usually found in the
\meta{enc}\texttt{enc.def} file for the encoding. The second part is
the font encoding, the specification of which is the topic of this
paper. It is also worth noticing that an actual font is a mapping of
horizontal material to printed text.

An alternative decomposition of the mapping from LICR to interpreted
text would be at the DVI command level, but even though this
decomposition is realized in most \TeX\ implementations, it has very
little relevance for the discussion of encodings. The main reason for
this is that it depends not only on the encoding of a font, but
also on its metrics. Furthermore it is worth noticing that in pdf\TeX\
there needs not be a DVI command level.
%%%%%%%%%%%%%%%%%%%%
In short for the present case: since `--' in LICR (or maybe it's rather
horizontal material) gets mapped to an endash in the interpreted text, it
follows that this should be part of the encoding specification.

Of course it's silly from the start with to have a conversion from -- to
endash built into the fonts, but now it is an established standard and
nothing we can do anything about. Such conversions (in the extent they are
at all desirable) could be handled differently in Omega, so hopefully their
inclusion in font encodings is a disappearing practice.

>> It is not very useful for typewriter fonts (LaTeX's \verb and
>> verbatim do a bunch of special declarations to escape these ligatures),
>
>For use with T1?

For all encodings, even though most of them were added because of T1. The
?' and !' ligatures have to be escaped even in OT1. The LaTeX macro which
does the necessary redeclarations is \verbatim@nolig@list.

>> but it is nontheless a standard in the encoding. The reason there is no
>> such ligature in OT1 typewriter fonts is rather that OT1 is a highly
>> questionable identification of what is at least five different encodings.
>
>I am curious:  cmr, cmtt, cmmi, cmsy, cmex?
>But the last three are math encodings (more or less ;-))

No, the math encodings aren't even involved. Basically there are three
different codingschemes used for fonts declared as OT1, viz.

   TEX TEXT                            e.g. for cmr10
   TEX TEXT WITHOUT F-LIGATURES        e.g. for cmcsc10
   TEX TYPEWRITER TEXT                 e.g. for cmtt10

and all three have different encodings (e.g. the TEX TEXT WITHOUT
F-LIGATURES fonts contains `<' and `>', whereas the TEX TEXT fonts do not).
On top of that, the italic CM fonts are differently encoded than their
nonitalic counterparts as the former have a sterling sign where the latter
have a dollar sign. (The OT1 defintions of \textdollar and \textsterling
sometimes fetches the character from a different font shape.) All that adds
up to five different encodings (there are italic versions of cmr10 and
cmtt10, but not of cmcsc10). More odd facts: the variable width typewriter
font cmvtt10 has the same encoding as cmr10 and cmr5 has the same encoding
as cmcsc10!

>Maybe it would be a good idea, to have subdivisons for (TeX) encodings?
>The the problem would vanish to having T1(.normal) and T1.tt that only
>differ in the used ligatures ...

There is probably nothing that can be done about OT1 as it is far too
messy, but it would be rather easy to define a typewriter companion, T1T
say, of T1. ectt wouldn't have that encoding, but the necessary font could
easily be created as a VF.


At 00.06 +0200 01-06-24, Sebastian Rahtz wrote:
>and indeed, that shows the mess Knuth got us into, by mixing so many
>concepts in a single place.

It's not as much Knuth that is to blame here (even though it wasn't a good
idea to use ligatures for these things to start with and he ought to have
provided some useful mechanism for escaping unwanted ligatures) as the
authors of NFSS, I suspect. Neither in PlainTeX nor in NFSS1 is there any
encoding concept---instead it is up to the user to make sure that things
work as intended---but in NFSS2 (which got into LaTeX2e) there is an
encoding concept (undoubtedly an improvement) which got flawed by the
identification of the five different OT1 encodings.

At 01.50 +0200 01-06-24, Sebastian Rahtz wrote:
>Berthold K.P. Horn writes:
> > Knuth used the "ligature" mechanism for three or so purposes:
> >
> > (i) real ligatures e.g.  f + i => fi
> > (ii) trick to access unusual glyphs not in plain ASCII keyboard
> > e.g. quoteleft + exclam => exclamdown
> > (iii) trick to access various quotation marks, and dashes.
>
>right. he should have stuck to (i), or at least let us control (ii)
>and (iii) from TeX

Agreed, but there is nothing that can be done about that now. We're stuck
with the convention, and the best thing we can do is avoid incorporating it
into future encodings.

>I feel an attack of wanting to use LY1 coming on.

Now that wouldn't do you any good, would it? If you accept the principle
that some ligatures are required by the encoding then it applies equally
well to both LY1 and T1 with the same consequences, as both have the same
ligatures in non-typewriter fonts. If you don't accept the principle then
there's no greater need to follow it in LY1 than in T1.

At 01.28 +0200 01-06-24, Uwe Koloska wrote:
>Am Samstag, 23. Juni 2001 14:12 hast du geschrieben:
>>
>> Any monospaced font installed by fontinst yields the
>> same result:  With OT1 encoding, "--" gives "--",
>> with T1 it gives "-".
>
>But what about this excerpt from 8r.etx:
>
>\setslot{hyphen}
>   \ifisint{typewriter}\then\Else
>      \ligature{LIG}{hyphen}{endash}
>%       \ligature{LIG}{alternate-hyphen}{endash}
>   \Fi
>   \comment{The hyphen `-'.}
>\endsetslot
>
>Not that I really understand what's going on in fontinst ;-)
>
>But, based on the answer to the question "are ligatures part
>of an encoding":  wouldn't it be a good idea to copy this behaviour
>to t1.etx?

I thought so when I wrote it, but now I tend to doubt it, for the reasons
mentioned above. It is also worth noticing that the test used to be for the
monowidth integer (which is set if the original AFM had IsFixedPitch true),
and in many cases the test should still be for this, as _all_ glyphs in
such fonts really have the same width. In particular, the letters in the fi
and fl glyphs are half the width of the normal letters. Furthermore the
endash is usually a dash half a character wide, but centered in the normal
character width. In comparison with the hyphen it looks positively bizarre.

Lars Hellström

PS: The aforementioned directory also contains a file t1draft.etx, which is
an attempt to specify the T1 encoding.

PPS: I thought I sent this yesterday, but appearently I didn't. Oh well.