[XeTeX] handling malformed UTF-8 input

Mojca Miklavec mojca.miklavec.lists at gmail.com
Thu Feb 21 09:26:43 CET 2008


On Thu, Feb 21, 2008 at 1:48 AM, Akira Kakuto wrote:
> > For those who are happy rebuilding xetex from source, I'd appreciate
>  > knowing of any problems with your (real) UTF-8 files after this patch
>  > is applied. As far as I know, valid files should be processed unchanged.
>
>  The new one fails to create ConTeXt format:
>  It stops when it is reading 'lang-cz.pat' with an
>  error message '!Nonletter'. Probably 'lang-cz.pat'
>  is not a utf-8 file.

The content is valid UTF-8, but there are a few latin2 (I guess)
characters in comments at the beginning of file.

That file is autogenerated (comments taken out of some other non-utf
file). The content/patterns should be OK, but thanks for the warning -
that can/should be fixed.

(Ulrike's suggestion to recognise end of lines seems OK to me as that
would tolerate problems in comments, while rest would be intact, but
in any case: garbage in->garbage out. Stopping processing the file
with an error if it's not a valid UTF-8 would be just as OK to me,
even though it might sound a bit radical. There are *lots* of warnings
in TeX files, and one can easily miss that one and miss the fact that
there are some broken characters somewhere in the last pages of a book
because of some problematic comments in the middle.)

Mojca


More information about the XeTeX mailing list