[XeTeX] handling malformed UTF-8 input

Thu Feb 21 19:37:36 CET 2008

On 21 Feb 2008, at 10:50 am, Taco Hoekwater wrote:

>
> Will Robertson wrote:
>> On 21/02/2008, at 8:42 PM, Jonathan Kew wrote:
>>
>>> What do others think about this -- should "invalid UTF-8 byte
>>> sequence" be an error rather than a warning and fallback?
>
> In such cases, luatex gives a "... contains an invalid utf-8 sequence"
> error, replaces the culprit with U+FFFD, and continues hoping
> to find proper utf-8 from then on.

I've just modified the implementation in XeTeX so that it no longer  
switches to "bytes" mode; it merely generates a warning message and  
reads the invalid byte(s) as U+FFFD. This means that you may get lots  
of warnings rather than just one, but it eliminates the problem of  
"garbage" in comments affecting how the rest the "real data" in the  
file is interpreted (as seen in that ConTeXt hyphenation file).

I can see some attraction to making this an error rather than a  
warning, but have not done this for the time being. Maybe at some  
point, however.

JK