[XeTeX] handling malformed UTF-8 input

Jonathan Kew jonathan_kew at sil.org
Wed Feb 20 13:25:46 CET 2008


On 20 Feb 2008, at 12:07 pm, Marcin Woliński wrote:

> Jonathan,
>
>> OK, motivated by this I have just committed a patch to the xetex
>> repository that checks for valid UTF-8 sequences (when reading a file
>
> Thank you very much for the quick reaction.
>
>> as UTF-8, of course). If an invalid sequence is encountered, it will
>> give a warning (in the log, unless \tracingonline is positive), and
>> read the remainder of the file as "bytes". This will often be wrong,
>
> What does "bytes" exactly mean?  That the rest of file will be
> interpreted as ISO 8859-1?

It means that the byte values 0..255 will be read as character codes  
0..255, with no attempt to map through a codepage. (Because of the  
correspondence between ISO 8859-1 and the first block of Unicode, the  
result is the same, except that strictly speaking I don't think ISO  
8859-1 defines the "C1 Controls" range 0x80..0x9F, leaving this to  
standards such as ISO 6429).

Note in particular that this is NOT the same as the Windows "Western"  
codepage 1252, which encodes various other printable characters in  
the 0x80..0x9F range.

JK



More information about the XeTeX mailing list