[XeTeX] handling malformed UTF-8 input
Ulrike Fischer
news2 at nililand.de
Tue Feb 19 18:30:46 CET 2008
Am Tue, 19 Feb 2008 17:27:39 +0100 schrieb Marcin Woliński:
> I'd like to report a funny problem with (mis)interpretation of malformed
> utf-8 input files. A few days ago a user of my document classes mwcls
>
> \else\ifnum#1<\previous at toc@level
> \addpenalty\@secpenalty % czy to dobra wartość?
> \fi\fi
>
> The file is ISO Latin-2 encoded
I don't know if I would describe a Latin-2-encoded file as a "malformed"
utf-8. Isn't it simply a non-utf8 file?
> (that is: comments include a few Latin-2
> characters, the code proper is pure ASCII) and XeTeX tries to interpret
> it as UTF-8. The character ć (cacute) is encoded as byte 1110 0110, so
> XeTeX considers it a start of a 3-byte sequence and ignores two
> following bytes, the second of which is an endline, so the next line
> gets commented out.
> I think however, that XeTeX could be more careful when reading malformed
> UTF-8 files. Since continuation bytes in UTF-8 sequences have to be of
> the form 10xxxxxx it would be safer to gobble only such bytes or at
> least not to treat ASCII characters as parts of UTF-8 sequences. That
> way the endline would be always interpreted as an endline and comments
> would always end where they should.
> Is that a change worth introducing?
On my PC (Miktex 2.7.) XeTeX crashed due to an "Umlaut" in a comment of
one of my styles. So I think it would be fine if XeTeX would at least
issue some meaningful error message if a file contains bytes sequences
that can't be part of an utf8-encoded file.
--
Ulrike Fischer
More information about the XeTeX
mailing list