[XeTeX] handling malformed UTF-8 input

Ross Moore ross at ics.mq.edu.au
Wed Feb 20 01:16:12 CET 2008


Hi Jonathan,

On 20/02/2008, at 10:18 AM, Jonathan Kew wrote:

> On 19 Feb 2008, at 4:27 pm, Marcin Woliński wrote:
>
>> I think however, that XeTeX could be more careful when reading
>> malformed
>> UTF-8 files.  Since continuation bytes in UTF-8 sequences have to
>> be of
>> the form 10xxxxxx it would be safer to gobble only such bytes or at
>> least not to treat ASCII characters as parts of UTF-8 sequences.   
>> That
>> way the endline would be always interpreted as an endline and  
>> comments
>> would always end where they should.
>> Is that a change worth introducing?
>
> OK, motivated by this I have just committed a patch to the xetex
> repository that checks for valid UTF-8 sequences (when reading a file
> as UTF-8, of course). If an invalid sequence is encountered, it will
> give a warning (in the log, unless \tracingonline is positive), and
> read the remainder of the file as "bytes".

Isn't this going a bit far?
Cannot you still recognise the ends of lines, and treat as "bytes"
until the end-of-line?  Reset to UTF-8 starting with the next line.
At the end, report the number of the lines which had problems.

If there are too many such lines, then give up and switch
to "bytes" for the rest of it.


> This will often be wrong,
> of course, but it's as good a guess as any if the user hasn't
> specified the proper encoding.

Surely it is most likely that someone has Copy/Pasted
something with the wrong encoding into a portion of
an otherwise valid file.
This is likely to be quite localised.

>
> (Depending on the exact nature of the first "bad" byte or sequence,
> you may get one occurrence of U+FFFD REPLACEMENT CHARACTER when XeTeX
> encounters a problem; then the remainder of the text will be read
> byte by byte.)
>
> For those who are happy rebuilding xetex from source, I'd appreciate
> knowing of any problems with your (real) UTF-8 files after this patch
> is applied. As far as I know, valid files should be processed  
> unchanged.
>
> JK

Just a thought.

Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                         ross at maths.mq.edu.au
Mathematics Department                             office: E7A-419
Macquarie University                               tel: +61 +2 9850 8955
Sydney, Australia  2109                            fax: +61 +2 9850 8114
------------------------------------------------------------------------




More information about the XeTeX mailing list