[XeTeX] handling malformed UTF-8 input

Fri Feb 22 04:24:04 CET 2008

Hi Jonathan,

On 22/02/2008, at 5:37 AM, Jonathan Kew wrote:

> On 21 Feb 2008, at 10:50 am, Taco Hoekwater wrote:
>
>>
>> Will Robertson wrote:
>>> On 21/02/2008, at 8:42 PM, Jonathan Kew wrote:
>>>
>>>> What do others think about this -- should "invalid UTF-8 byte
>>>> sequence" be an error rather than a warning and fallback?
>>
>> In such cases, luatex gives a "... contains an invalid utf-8  
>> sequence"
>> error, replaces the culprit with U+FFFD, and continues hoping
>> to find proper utf-8 from then on.
>
> I've just modified the implementation in XeTeX so that it no longer
> switches to "bytes" mode; it merely generates a warning message and
> reads the invalid byte(s) as U+FFFD. This means that you may get lots
> of warnings rather than just one, but it eliminates the problem of
> "garbage" in comments affecting how the rest the "real data" in the
> file is interpreted (as seen in that ConTeXt hyphenation file).

I fully support this approach.

The reason is that I'm involved with working with data supplied
via web-forms; e.g. for conference abstracts and personal data.
Prospective delegates will Copy/Paste information from various
sources, which will then be entered into data-base fields and
combined to produce a PDF (or whatever) of their data for
previewing the accuracy of the interpretation of what they
have entered.

It is an absolute certainty that people will input bad data;
but even more of a given that they will input data that
was good in the context from which it was copied, but may
not be so good in the new context in which it will be used.

It is imperative that any TeX job that is automated to use
such user-input must be able to run to completion and
produce a sensible PDF, even if it contains bad data.
The user, who input the data, must be able to see what
is wrong and report it, even if they cannot be expected
to do anything about it.
These users cannot be expected to have any knowledge
whatsoever about what encodings are being used, at any
point in such an automated process. Furthermore, having
viewed the result of the TeX job is likely to be the only
opportunity whereby such users will have to provide feedback
about whether the data they have entered is correct or not.

>
> I can see some attraction to making this an error rather than a
> warning, but have not done this for the time being. Maybe at some
> point, however.

If it becomes an error, then it will be extremely difficult
to use XeTeX in automated processes of the kind that I have
tried to describe above. That would be a very great pity.

>
> JK

Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                         ross at maths.mq.edu.au
Mathematics Department                             office: E7A-419
Macquarie University                               tel: +61 +2 9850 8955
Sydney, Australia  2109                            fax: +61 +2 9850 8114
------------------------------------------------------------------------