[XeTeX] handling malformed UTF-8 input

Sat Feb 23 04:34:55 CET 2008

On Fri, Feb 22, 2008 at 09:09:16PM -0500, Mike Maxwell wrote:
> Ross Moore wrote:
> > If there was to be malformed data in the name field,
> > this should *not* cause correctly formed UTF8 data in the
> > subsequent address field to be displayed in a "bytes" mode.
>
> Can you reliably recover from such an error in UTF-8 data?  That is,
> assume that there is a mal-formed byte where you're expecting the first
> byte of a UTF-8 character.  How do you know where the next (and possibly
> correct, possibly incorrect) UTF-8 character should begin?

Such resynchronization is quite straightforward with UTF-8; this is one of
the significant advantages of this encoding system.  One simply reads ahead
until one finds a byte that matches the pattern 11xxxxxx or 0xxxxxxx and
begins decoding again from that point.  All continuation bytes have the
form 10xxxxxx.

Of course, the program may have to discard arbitrarily many bytes, up to
and including the entire remainder of the input source, before it can
resynchronize.

Also note that not all bytes that match 11xxxxxx are legal as the first
byte in a UTF-8 encoding of a Unicode character; in particular, one
shouldn't ever see 11111xxx.  See the Wikipedia article on UTF-8 for more
details.

I'll leave the question of whether this resynchronization is the Right
Thing for XeTeX to others who've thought more about its consequences than I
have.

Richard