[XeTeX] handling malformed UTF-8 input

Thu Feb 21 11:12:50 CET 2008

On 21 Feb 2008, at 8:26 am, Mojca Miklavec wrote:

> On Thu, Feb 21, 2008 at 1:48 AM, Akira Kakuto wrote:
>>> For those who are happy rebuilding xetex from source, I'd appreciate
>>> knowing of any problems with your (real) UTF-8 files after this  
>>> patch
>>> is applied. As far as I know, valid files should be processed  
>>> unchanged.
>>
>>  The new one fails to create ConTeXt format:
>>  It stops when it is reading 'lang-cz.pat' with an
>>  error message '!Nonletter'. Probably 'lang-cz.pat'
>>  is not a utf-8 file.
>
> The content is valid UTF-8, but there are a few latin2 (I guess)
> characters in comments at the beginning of file.

Yes, it looks that way.

>
> That file is autogenerated (comments taken out of some other non-utf
> file). The content/patterns should be OK, but thanks for the warning -
> that can/should be fixed.

Indeed it should. Mixing encodings in a "plain text" file is a no- 
no.... there's no reliable way for processes to know how to interpret  
the bytes they find. You may get away with it in TeX files if the  
misinterpreted garbage happens to follow a '%' byte, but that doesn't  
make it acceptable. Suppose someone tries to print a verbatim listing  
of the file...

(Try opening lang-cz.pat in a text editor. Either it'll be read as  
UTF-8, giving you garbage in the "samples", or as some other  
encoding, in which case the patterns themselves will appear as  
garbage. Sorry to sound harsh, but the file is fundamentally broken.  
That needs to be fixed in ConTeXt, not worked around in XeTeX.)

>
> (Ulrike's

(I think that was Ross, actually.)

> suggestion to recognise end of lines seems OK to me as that
> would tolerate problems in comments, while rest would be intact, but
> in any case: garbage in->garbage out.

No, I can't agree with this. If the file is broken w.r.t. encoding,  
we should either do a one-time switch to "raw bytes" mode, so as to  
try and continue processing in a default "simplified" mode (which may  
or may not lead to subsequent errors, of course), or stop immediately.

> Stopping processing the file
> with an error if it's not a valid UTF-8 would be just as OK to me,
> even though it might sound a bit radical.

I wondered about making it an error rather than a warning; maybe that  
would be better.

> There are *lots* of warnings
> in TeX files, and one can easily miss that one and miss the fact that
> there are some broken characters somewhere in the last pages of a book
> because of some problematic comments in the middle.)

What do others think about this -- should "invalid UTF-8 byte  
sequence" be an error rather than a warning and fallback?

JK