[XeTeX] handling malformed UTF-8 input

Marcin Woliński wolinski at mimuw.edu.pl
Tue Feb 19 17:27:39 CET 2008


Hi,

I'd like to report a funny problem with (mis)interpretation of malformed
utf-8 input files.   A few days ago a user of my document classes mwcls
(e.g.,
http://www.tug.org/texlive/devsrc/Master/texmf-dist/tex/latex/mwcls/mwart.cls) reported being unable to process a document with XeTeX.  A quick examination revealed that the source of the problem is the following comment, which makes XeTeX not see the following line with \fi-s:

    \else\ifnum#1<\previous at toc@level
        \addpenalty\@secpenalty % czy to dobra wartość?
     \fi\fi

The file is ISO Latin-2 encoded (that is: comments include a few Latin-2
characters, the code proper is pure ASCII) and XeTeX tries to interpret
it as UTF-8.  The character ć (cacute) is encoded as byte 1110 0110, so
XeTeX considers it a start of a 3-byte sequence and ignores two
following bytes, the second of which is an endline, so the next line
gets commented out.

Other instances of this mechanism are illustrated in the attached file
(try running it with and without \XeTeXinputencoding set).

This of course could be considered a bug in mwart.cls and obviously I'm
going to correct it there.

I think however, that XeTeX could be more careful when reading malformed
UTF-8 files.  Since continuation bytes in UTF-8 sequences have to be of
the form 10xxxxxx it would be safer to gobble only such bytes or at
least not to treat ASCII characters as parts of UTF-8 sequences.  That
way the endline would be always interpreted as an endline and comments
would always end where they should.
Is that a change worth introducing?

With best,
Marcin
-- 
Marcin Woliński, PhD
Institute of Computer Science, Polish Academy of Sciences
http://www.ipipan.waw.pl/~wolinski/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: comment-problem.tex
Type: text/x-tex
Size: 401 bytes
Desc: not available
Url : http://tug.org/pipermail/xetex/attachments/20080219/40d2a1b7/attachment.bin 


More information about the XeTeX mailing list