[XeTeX] handling malformed UTF-8 input
Jonathan Kew
jonathan_kew at sil.org
Thu Feb 21 19:37:36 CET 2008
On 21 Feb 2008, at 10:50 am, Taco Hoekwater wrote:
>
> Will Robertson wrote:
>> On 21/02/2008, at 8:42 PM, Jonathan Kew wrote:
>>
>>> What do others think about this -- should "invalid UTF-8 byte
>>> sequence" be an error rather than a warning and fallback?
>
> In such cases, luatex gives a "... contains an invalid utf-8 sequence"
> error, replaces the culprit with U+FFFD, and continues hoping
> to find proper utf-8 from then on.
I've just modified the implementation in XeTeX so that it no longer
switches to "bytes" mode; it merely generates a warning message and
reads the invalid byte(s) as U+FFFD. This means that you may get lots
of warnings rather than just one, but it eliminates the problem of
"garbage" in comments affecting how the rest the "real data" in the
file is interpreted (as seen in that ConTeXt hyphenation file).
I can see some attraction to making this an error rather than a
warning, but have not done this for the time being. Maybe at some
point, however.
JK
More information about the XeTeX
mailing list