[XeTeX] handling malformed UTF-8 input

Ulrike Fischer news2 at nililand.de
Wed Feb 20 10:23:02 CET 2008


Am Wed, 20 Feb 2008 11:16:12 +1100 schrieb Ross Moore:

>> OK, motivated by this I have just committed a patch to the xetex
>> repository that checks for valid UTF-8 sequences (when reading a file
>> as UTF-8, of course). If an invalid sequence is encountered, it will
>> give a warning (in the log, unless \tracingonline is positive), and
>> read the remainder of the file as "bytes".
> 
> Isn't this going a bit far?
> Cannot you still recognise the ends of lines, and treat as "bytes"
> until the end-of-line?  Reset to UTF-8 starting with the next line.
> At the end, report the number of the lines which had problems.
> 
> If there are too many such lines, then give up and switch
> to "bytes" for the rest of it.

> Surely it is most likely that someone has Copy/Pasted
> something with the wrong encoding into a portion of
> an otherwise valid file.

I can't see how this can happen unless you are using some hex-editor and
copy the bytes directly. In normal editors, if you copy from a file to
another and then _save the file_ the editor will save the complete file
in one encoding not in two. The meaning of the copied chars will perhaps
be wrong but not their encoding. So if xetex encounters one non-utf8
continuation bit it is very probable that the whole file is non-utf8. 


 
-- 
Ulrike Fischer 



More information about the XeTeX mailing list