[XeTeX] handling malformed UTF-8 input

Wed Feb 20 10:35:36 CET 2008

On 20 Feb 2008, at 9:23 am, Ulrike Fischer wrote:

> Am Wed, 20 Feb 2008 11:16:12 +1100 schrieb Ross Moore:
>
>>> OK, motivated by this I have just committed a patch to the xetex
>>> repository that checks for valid UTF-8 sequences (when reading a  
>>> file
>>> as UTF-8, of course). If an invalid sequence is encountered, it will
>>> give a warning (in the log, unless \tracingonline is positive), and
>>> read the remainder of the file as "bytes".
>>
>> Isn't this going a bit far?
>> Cannot you still recognise the ends of lines, and treat as "bytes"
>> until the end-of-line?  Reset to UTF-8 starting with the next line.
>> At the end, report the number of the lines which had problems.
>>
>> If there are too many such lines, then give up and switch
>> to "bytes" for the rest of it.

I think that sort of "flip-flopping" between encodings would be quite  
confusing. The file is broken (or has been mis-identified), and we  
have essentially no hope of processing it "correctly" (whatever that  
means in such a case), so let's immediately fall back to a "safe"  
mode that does not attempt any special interpretation, and leave it  
to the user to correct the problem.

>
>> Surely it is most likely that someone has Copy/Pasted
>> something with the wrong encoding into a portion of
>> an otherwise valid file.

Much more likely, IMO, is that they're trying to process an 8-bit,  
non-Unicode file and didn't tell XeTeX the right encoding.

>
> I can't see how this can happen unless you are using some hex- 
> editor and
> copy the bytes directly. In normal editors, if you copy from a file to
> another and then _save the file_ the editor will save the complete  
> file
> in one encoding not in two. The meaning of the copied chars will  
> perhaps
> be wrong but not their encoding. So if xetex encounters one non-utf8
> continuation bit it is very probable that the whole file is non-utf8.

Yes, that's what I thought too. If you're using a Unicode editor,  
saving as UTF-8, then whatever you paste in must get mapped in some  
way to Unicode character values (whether they're the ones you wanted  
or not!), and the saved text will be valid UTF-8. And if you're using  
a byte-oriented editor, it's unlikely that you are dealing with UTF-8  
text files.

JK