[XeTeX] How to use EC font encoding in XeTeX?

Sat Jun 10 10:35:41 CEST 2006

On 9 Jun 2006, at 11:25 pm, Mojca Miklavec wrote:

> Hello,
>
> Just wondering: how does XeTeX deal with hyphenation when working with
> EC-encoded fonts?

Incorrectly!

More seriously... this is a fundamental problem in TeX, due to the  
lack of a character/glyph distinction. Text-processing operations  
such as hyphenation are carried out in the same code-space as font  
access, which means that TeX is dealing with text in a font- 
dependent, glyph-encoded way. Packages such as inputenc or extensions  
like encTeX allow transcoding between codes in an external input and  
"character" codes within TeX, but there's no way to do such a mapping  
between the "character" codes TeX processes (e.g., hyphenates) and  
the font access codes that end up in the DVI (or PDF) file.

So the result you get will depend on the encoding used by the  
patterns. Hyphenation patterns are necessarily expressed in terms of  
a specific encoding, and in TeX that has to be the "font encoding" --  
the character codes that are being used by TeX to interact with  
fonts, whether that means TFM files or OpenType fonts. You can use  
TeX macros to remap codes between the input file and the application  
of hyphenation patterns, but not between the application of patterns  
and the font access codes.

So any given set of patterns loaded in TeX are specific to a certain  
*font* encoding, and will not work correctly with other encodings.  
Some of the standard patterns (e.g., German) fudge this issue  
slightly, by including duplicate patterns for two encodings (e.g.,  
patterns with ^^ff [T1] for ß, and also patterns using ^^Y [OT1] for  
ß). They get away with this because in T1 text, ^^Y will not occur in  
words, and in OT1 text, ^^ff will not occur, so the extra patterns  
never match in practice. But it's a hack, and is fundamentally a dead- 
end approach that cannot extend to cover the full range of possible  
font encodings, only a few cases with very limited differences.

The problem becomes much more obvious in cases such as Russian, where  
the hyphenation files have to be configured at format-compile time to  
load according to the particular Cyrillic font encoding you want to  
use, and if you subsequently decide to use a different encoding, they  
simply won't work right.

This cannot be solved at the TeX macro level, unless we were to build  
format files with patterns loaded multiple times under different  
\language codes, and then use these different codes according to the  
font encoding chosen at runtime.

I believe that a better way forward is to move away from processing  
text in the old custom font encodings as quickly as possible, towards  
use of Unicode as *the* standard encoding for text data. Obviously,  
we can't achieve that in an instant, but it's the long-term answer,  
and that's the goal that I'd prefer to put effort towards. So in the  
XeTeX formats that I create, I aim to load patterns as Unicode  
wherever possible, so that hyphenation will work correctly for  
Unicode text. Because Unicode and Latin1 character codes match for  
letters with codes < 256, the patterns should also behave correctly  
when used with Latin1-encoded text/fonts, which is adequate for many  
of the major western European languages. But they can't  
simultaneously support all the other, differently-arranged font  
encodings. You can, of course, load patterns in XeTeX using other  
encodings if you wish; but what you can't have is a single set of  
patterns loaded in the format file that correctly hyphenate text in  
conflicting encodings.

It would be possible to extend TeX so as to support some kind of  
mapping between the "character" codes in the text to be hyphenated  
and the pattern codes; perhaps a new table \unicode, similar to  
\lccode, could map font-encoded characters (in the text) to Unicode  
values (in the hyphenation tables) on the fly. This could then be  
changed at runtime to match the current font encoding. This would  
need support from LaTeX/ConTeXt packages as well as an extension in  
the program itself, though; I'm not sure it's the most worthwhile way  
to invest effort.

JK