[XeTeX] handling malformed UTF-8 input

Tue Feb 19 19:19:19 CET 2008

Hi Marcin,

On 19 Feb 2008, at 4:27 pm, Marcin Woliński wrote:

> Hi,
>
> I'd like to report a funny problem with (mis)interpretation of  
> malformed
> utf-8 input files.   A few days ago a user of my document classes  
> mwcls
> (e.g.,
> http://www.tug.org/texlive/devsrc/Master/texmf-dist/tex/latex/mwcls/ 
> mwart.cls) reported being unable to process a document with XeTeX.   
> A quick examination revealed that the source of the problem is the  
> following comment, which makes XeTeX not see the following line  
> with \fi-s:
>
>     \else\ifnum#1<\previous at toc@level
>         \addpenalty\@secpenalty % czy to dobra wartość?
>      \fi\fi
>
> The file is ISO Latin-2 encoded (that is: comments include a few  
> Latin-2
> characters, the code proper is pure ASCII) and XeTeX tries to  
> interpret
> it as UTF-8.  The character ć (cacute) is encoded as byte 1110  
> 0110, so
> XeTeX considers it a start of a 3-byte sequence and ignores two
> following bytes, the second of which is an endline, so the next line
> gets commented out.
>
> Other instances of this mechanism are illustrated in the attached file
> (try running it with and without \XeTeXinputencoding set).
>
> This of course could be considered a bug in mwart.cls and obviously  
> I'm
> going to correct it there.

Well, it's not really a bug, but it does lead to an unnecessary  
incompatibility with programs (like xetex) that try to process it as  
UTF-8. And unfortunately there's no real standard for tagging plain- 
text files with encoding information; there are various conventions  
but none of them are universal. So keeping "code" such as TeX macros  
in plain ASCII wherever possible is the safest and most portable  
option, I think.

>
> I think however, that XeTeX could be more careful when reading  
> malformed
> UTF-8 files.  Since continuation bytes in UTF-8 sequences have to  
> be of
> the form 10xxxxxx it would be safer to gobble only such bytes or at
> least not to treat ASCII characters as parts of UTF-8 sequences.  That
> way the endline would be always interpreted as an endline and comments
> would always end where they should.
> Is that a change worth introducing?

Yes, you're right; XeTeX is not careful about this, and should be  
made more robust. This is something that's been nagging at my mind,  
as I know it's a potential problem, so this gives an added incentive  
to fix it.

Thanks,

JK