[XeTeX] How to use EC font encoding in XeTeX?
jonathan_kew at sil.org
Sat Jun 10 10:35:41 CEST 2006
On 9 Jun 2006, at 11:25 pm, Mojca Miklavec wrote:
> Just wondering: how does XeTeX deal with hyphenation when working with
> EC-encoded fonts?
More seriously... this is a fundamental problem in TeX, due to the
lack of a character/glyph distinction. Text-processing operations
such as hyphenation are carried out in the same code-space as font
access, which means that TeX is dealing with text in a font-
dependent, glyph-encoded way. Packages such as inputenc or extensions
like encTeX allow transcoding between codes in an external input and
"character" codes within TeX, but there's no way to do such a mapping
between the "character" codes TeX processes (e.g., hyphenates) and
the font access codes that end up in the DVI (or PDF) file.
So the result you get will depend on the encoding used by the
patterns. Hyphenation patterns are necessarily expressed in terms of
a specific encoding, and in TeX that has to be the "font encoding" --
the character codes that are being used by TeX to interact with
fonts, whether that means TFM files or OpenType fonts. You can use
TeX macros to remap codes between the input file and the application
of hyphenation patterns, but not between the application of patterns
and the font access codes.
So any given set of patterns loaded in TeX are specific to a certain
*font* encoding, and will not work correctly with other encodings.
Some of the standard patterns (e.g., German) fudge this issue
slightly, by including duplicate patterns for two encodings (e.g.,
patterns with ^^ff [T1] for ß, and also patterns using ^^Y [OT1] for
ß). They get away with this because in T1 text, ^^Y will not occur in
words, and in OT1 text, ^^ff will not occur, so the extra patterns
never match in practice. But it's a hack, and is fundamentally a dead-
end approach that cannot extend to cover the full range of possible
font encodings, only a few cases with very limited differences.
The problem becomes much more obvious in cases such as Russian, where
the hyphenation files have to be configured at format-compile time to
load according to the particular Cyrillic font encoding you want to
use, and if you subsequently decide to use a different encoding, they
simply won't work right.
This cannot be solved at the TeX macro level, unless we were to build
format files with patterns loaded multiple times under different
\language codes, and then use these different codes according to the
font encoding chosen at runtime.
I believe that a better way forward is to move away from processing
text in the old custom font encodings as quickly as possible, towards
use of Unicode as *the* standard encoding for text data. Obviously,
we can't achieve that in an instant, but it's the long-term answer,
and that's the goal that I'd prefer to put effort towards. So in the
XeTeX formats that I create, I aim to load patterns as Unicode
wherever possible, so that hyphenation will work correctly for
Unicode text. Because Unicode and Latin1 character codes match for
letters with codes < 256, the patterns should also behave correctly
when used with Latin1-encoded text/fonts, which is adequate for many
of the major western European languages. But they can't
simultaneously support all the other, differently-arranged font
encodings. You can, of course, load patterns in XeTeX using other
encodings if you wish; but what you can't have is a single set of
patterns loaded in the format file that correctly hyphenate text in
It would be possible to extend TeX so as to support some kind of
mapping between the "character" codes in the text to be hyphenated
and the pattern codes; perhaps a new table \unicode, similar to
\lccode, could map font-encoded characters (in the text) to Unicode
values (in the hyphenation tables) on the fly. This could then be
changed at runtime to match the current font encoding. This would
need support from LaTeX/ConTeXt packages as well as an extension in
the program itself, though; I'm not sure it's the most worthwhile way
to invest effort.
More information about the XeTeX