[tex-live] packages with characters > 127

Ulrike Fischer news3 at nililand.de
Tue Dec 29 09:59:48 CET 2009

Am Mon, 28 Dec 2009 19:57:50 +0100 schrieb Manuel Pégourié-Gonnard:

>> If the file is 8-bit-encoded and contains non-ASCII-char xetex will
>> probably tumbles over an invalid sequence. In this case you get a
>> message:
>> !Invalid UTF-8 byte or sequence at line 4 replaced by U+FFFD.!
>> and xetex continues to read this file in the so-called «bytes-mode».

> Is it still true (the "bytes mode")? I seem to remember that it was
> dropped at some point (so that XeTeX just continues in utf-8), though
> I have no reference at hand.

You are right. It was even done during the same discussion where the
continuation in "bytes-mode" were introduced.

The discussion also discribe the behaviour in luatex at the time:

> In such cases, luatex gives a "... contains an invalid utf-8 sequence"
> error, replaces the culprit with U+FFFD, and continues hoping
> to find proper utf-8 from then on.

I've just modified the implementation in XeTeX so that it no longer
switches to "bytes" mode; it merely generates a warning message and
reads the invalid byte(s) as U+FFFD. This means that you may get
lots of warnings rather than just one, but it eliminates the
problem of "garbage" in comments affecting how the rest the "real
data" in the file is interpreted (as seen in that ConTeXt
hyphenation file)."

So it looks as if non-ASCII chars in comments shouldn't be a problem
in luatex too.

Ulrike Fischer 

