[XeTeX] handling malformed UTF-8 input

Ross Moore ross at ics.mq.edu.au
Sat Feb 23 06:13:04 CET 2008


Hi Mike,

On 23/02/2008, at 1:09 PM, Mike Maxwell wrote:

> Ross Moore wrote:
>> If there was to be malformed data in the name field,
>> this should *not* cause correctly formed UTF8 data in the
>> subsequent address field to be displayed in a "bytes" mode.
>
> Can you reliably recover from such an error in UTF-8 data?  That is,
> assume that there is a mal-formed byte where you're expecting the  
> first
> byte of a UTF-8 character.  How do you know where the next (and  
> possibly
> correct, possibly incorrect) UTF-8 character should begin?

Absolutely; UTF8 is designed as follows.

Any byte starting:
   with  0   is an ascii 7-bit character;
   with  10  is a data-byte for a 2+ bit sequence;
   with  11...10  is a header byte for a 2+ byte
    sequence, where the number of consecutive 1s
    tells how many bytes are involved.

Thus it is possible to tell where there is an error,
and which bytes are involved in that error.

More specifically, a character byte-sequence *must*
start with either 0.... (1-byte) or  11.... (2+ bytes).
If the latter, the number of trailing bytes is known,
and each must start with  10.... .
If this does not happen, then you know that there is
an error, and you can tell where a valid sequence
might (!) restart.

(In fact the proposal was to restart UTF8 after the
next line-end character, either Ux000A or Ux000D.
In reality, it could restart at the next ASCII character
or try to restart at any  11...10.. byte.)


Thus all bytes that don't fit validly into a UTF8
sequence can be identified and marked as being bad.

Of course if the encoding was not intended to be UTF8
then there could be lots of bytes marked as being bad.
So an expert needs to try to identify what encoding
was intended, by trying out different ones to see what
gives a sensible character string for some language.
The context in which the data was obtained should be
sufficient to allow this, in any practical situation.
Feedback from whomever provided that data helps
in being confident that the correct interpretation
has been obtained.

This is not mathematical certainty --- but it should
not need to be.

> -- 
>     Mike Maxwell
>     What good is a universe without somebody around to look at it?
>     --Robert Dicke, Princeton physicist


Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                         ross at maths.mq.edu.au
Mathematics Department                             office: E7A-419
Macquarie University                               tel: +61 +2 9850 8955
Sydney, Australia  2109                            fax: +61 +2 9850 8114
------------------------------------------------------------------------




More information about the XeTeX mailing list