[XeTeX] handling malformed UTF-8 input

Wed Feb 20 00:18:11 CET 2008

On 19 Feb 2008, at 4:27 pm, Marcin Woliński wrote:

> I think however, that XeTeX could be more careful when reading  
> malformed
> UTF-8 files.  Since continuation bytes in UTF-8 sequences have to  
> be of
> the form 10xxxxxx it would be safer to gobble only such bytes or at
> least not to treat ASCII characters as parts of UTF-8 sequences.  That
> way the endline would be always interpreted as an endline and comments
> would always end where they should.
> Is that a change worth introducing?

OK, motivated by this I have just committed a patch to the xetex  
repository that checks for valid UTF-8 sequences (when reading a file  
as UTF-8, of course). If an invalid sequence is encountered, it will  
give a warning (in the log, unless \tracingonline is positive), and  
read the remainder of the file as "bytes". This will often be wrong,  
of course, but it's as good a guess as any if the user hasn't  
specified the proper encoding.

(Depending on the exact nature of the first "bad" byte or sequence,  
you may get one occurrence of U+FFFD REPLACEMENT CHARACTER when XeTeX  
encounters a problem; then the remainder of the text will be read  
byte by byte.)

For those who are happy rebuilding xetex from source, I'd appreciate  
knowing of any problems with your (real) UTF-8 files after this patch  
is applied. As far as I know, valid files should be processed unchanged.

JK