[XeTeX] xelatex: problem with dumping format

Fri Jun 4 16:04:57 CEST 2004

On 4 Jun 2004, at 2:11 pm, Gilles Pérez-Lambert wrote:

> Hello,
>
> I tried to create the format for xelatex (with fmtutil) after
> installing it in my texlive~2003 distribution and I got:
>         .
>         .
> ....many things...
> (/usr/local/texlive2003/texmf/tex/generic/hyphen/hrhyph.tex)
> \l at hungarian=\language16
>
> (/usr/local/texlive2003/texmf/tex/generic/hyphen/huhyph.tex
> Hungarian hyphenation patterns
> ! Nonletter.
> l.51 1b<C3><AF><C2><88><C2><B1>
>           be
> ?
> ! Emergency stop.
> l.51 1b<C3><AF><C2><88><C2><B1>
>           be
> End of file on the terminal!
>         .
>         .
> For now, I suppressed the Hungarian and the Russian hyphantion patterns
> in language.dat for xelatex to work.
>
> Any idea?

This is a known issue, mentioned on the XeTeX FAQ page at:

	http://scripts.sil.org/xetex_faq

See the question "XeTeX is installed, but there's no xelatex.xfmt 
format so I can't use it with LaTeX files. And fmtutil can't seem to 
build this format. What's wrong?".

XeTeX reads all input files as Unicode text. This means that:

(a) Plain ASCII files are fine, because they're also valid UTF8-encoded 
Unicode

(b) 8-bit files with non-ASCII characters cannot be read, in general, 
and even if they can be read, they probably won't be interpreted as 
expected.

There are some additional considerations that apply to some of the 
multilingual hyphenation patterns (as well as to your input text files, 
of course):

(c) Many of the 8-bit codes in TeX-Latin1 (Cork) encoding correspond 
directly to Unicode codepoints. Therefore, if a file uses these 
character codes, expressed as ^^xx sequences, it will work fine. (But 
if it uses the literal 8-bit characters, XeTeX will try to interpret 
them as UTF8 sequences, and fail.)

(d) There are exceptions to (c); in particular, the codes 0x80..0xBF 
don't match, nor do 0xDF and 0xFF, if I recall correctly. So the fact 
that such a multilingual file can be read by the program doesn't 
necessarily mean that it will be correct for the Unicode environment.

I have adapted some of the pattern files that currently give trouble, 
including Hungarian and Russian, to be readable in XeTeX and to load 
the correct Unicode-encoded patterns; I expect these will be included 
with the  next package I release. And they'll provide a model for how 
others can be updated, too. (I've tried to do this in such a way that 
the same files can still be used in standard TeX as well as in XeTeX, 
despite the different encodings in use.)

Note, however, that loading correct Unicode patterns will NOT give the 
expected hyphenation if you try and use them in conjunction with text 
in some other encoding! XeTeX is really designed to be purely a Unicode 
system; it does try to continue working with older non-standard 
encodings, to the extent that these can be treated as though they were 
simply Unicode values re-used, but a mixed-encoding world is a messy 
place to live.

> By the way, is there a way to have babel work with xetex?

I know essentially nothing about babel, but my impression is that it 
is, partly at least, a solution for working with multiple input and 
font encodings in the legacy 8-bit world, and so I suspect the marriage 
of babel with XeTeX will be an untidy affair at best.

But someone who actually knows about it, and also understands Unicode 
issues, may be able to answer more fully.

Hope this is helpful,

Jonathan