[XeTeX] handling malformed UTF-8 input

Tue Feb 19 18:30:46 CET 2008

Am Tue, 19 Feb 2008 17:27:39 +0100 schrieb Marcin Woliński:

> I'd like to report a funny problem with (mis)interpretation of malformed
> utf-8 input files.   A few days ago a user of my document classes mwcls
> 
>     \else\ifnum#1<\previous at toc@level
>         \addpenalty\@secpenalty % czy to dobra wartość?
>      \fi\fi
> 
> The file is ISO Latin-2 encoded 

I don't know if I would describe a Latin-2-encoded file as a "malformed"
utf-8. Isn't it simply a non-utf8 file? 

> (that is: comments include a few Latin-2
> characters, the code proper is pure ASCII) and XeTeX tries to interpret
> it as UTF-8.  The character ć (cacute) is encoded as byte 1110 0110, so
> XeTeX considers it a start of a 3-byte sequence and ignores two
> following bytes, the second of which is an endline, so the next line
> gets commented out.

> I think however, that XeTeX could be more careful when reading malformed
> UTF-8 files.  Since continuation bytes in UTF-8 sequences have to be of
> the form 10xxxxxx it would be safer to gobble only such bytes or at
> least not to treat ASCII characters as parts of UTF-8 sequences.  That
> way the endline would be always interpreted as an endline and comments
> would always end where they should.
> Is that a change worth introducing?

On my PC (Miktex 2.7.) XeTeX crashed due to an "Umlaut" in a comment of
one of my styles. So I think it would be fine if XeTeX would at least
issue some meaningful error message if a file contains bytes sequences
that can't be part of an utf8-encoded file. 

-- 
Ulrike Fischer