[XeTeX] LaTeX Font Warning: Encoding `OT1' has changed to `U' for symbol font

Sat May 13 10:21:32 CEST 2006

On 13 May 2006, at 8:49 am, Will Robertson wrote:

> I haven't pushed for the creation of our very own font encoding for
> XeTeX+fontspec for the reason that "font encoding" doesn't mean the
> same thing when you're using unicode fonts. In LaTeX proper, I font
> encoding is supposed to mean (although doesn't always, to their
> chagrin) that the font contains *exactly* some set of glyphs. Then,
> if a character is requested that cannot be typeset with this font,
> corrective measures are taken.
>
> But consider what this means: every non-ascii character is active and
> assigned a "LaTeX Input Character Representation" (or something,
> can't ever remember what the acronym stands for). E.g. "é" -> \eacute
> -> slot XXX in font with encoding YYY. With XeTeX, we know we're
> always using unicode fonts. So this step is largely unnecessary. "é"
> in the source will be typeset directly by XeTeX.
>
> In order to fulfil the LaTeX paradigm, we'd need to set up a mapping
> for every character in unicode to a control sequence, back to a glyph
> in a unicode font. We'd then need to start representing fonts by
> which subset of unicode they contain, and this will never be fully
> consistent across fonts.

Right.... we really don't want to go down that road (IMO). Bear in  
mind that there are around 100,000 characters defined in Unicode, and  
growing.... it just doesn't make sense to try and have an Internal  
Character Representation for each, separate from the character code  
itself.

The whole font encoding system arose because in the byte-oriented  
world, it was necessary to have fonts that supported many different  
collections of glyphs (as no single 256-glyph collection could  
include everything that people wanted to typeset). So this meant that  
a given byte (character) code meant different things, depending on  
the font associated with it; and that the code required to access a  
given glyph depended which font you were using. So \eacute might need  
to be output as code 140 in one font, code 200 in another, and  
composed using \accent in a third.

Moving to Unicode as the single character encoding standard means  
that \eacute is always U+00E9, and this entire font-dependent mapping  
layer can go away.

Actually, in one sense it doesn't go away, but it moves from within  
TeX to inside the font. Glyphs in fonts are *really* rendered via  
glyph IDs (TrueType) or names (T1) -- but the mapping of Unicode  
character codes (universal) to glyphs (font-specific) is defined  
within the font itself. So the text processing software doesn't need  
to be concerned with it; we simply pass Unicode character codes to  
the font renderer.

> In the scheme above, you could perform error checking in the stage
> from going from command name to unicode glyph, but it would be
> terribly inefficient (consider that Jonathan Kew hasn't wanted to
> implement this *in the source*. This would be much slower.)

Actually, I'm considering a change here, to support generating  
warnings for character codes that are not supported by the font in  
use. This would be under the control of \tracinglostchars, just like  
TeX's warnings for legacy TFM-based fonts. (Though as most of the  
current font encodings fill all 256 slots, people probably aren't  
used to seeing those messages very often. IIRC, they're only written  
to the log by default, not to the console, so they're rather easy to  
overlook anyway.)

JK