[XeTeX] handling malformed UTF-8 input
Mojca Miklavec
mojca.miklavec.lists at gmail.com
Thu Feb 21 09:26:43 CET 2008
On Thu, Feb 21, 2008 at 1:48 AM, Akira Kakuto wrote:
> > For those who are happy rebuilding xetex from source, I'd appreciate
> > knowing of any problems with your (real) UTF-8 files after this patch
> > is applied. As far as I know, valid files should be processed unchanged.
>
> The new one fails to create ConTeXt format:
> It stops when it is reading 'lang-cz.pat' with an
> error message '!Nonletter'. Probably 'lang-cz.pat'
> is not a utf-8 file.
The content is valid UTF-8, but there are a few latin2 (I guess)
characters in comments at the beginning of file.
That file is autogenerated (comments taken out of some other non-utf
file). The content/patterns should be OK, but thanks for the warning -
that can/should be fixed.
(Ulrike's suggestion to recognise end of lines seems OK to me as that
would tolerate problems in comments, while rest would be intact, but
in any case: garbage in->garbage out. Stopping processing the file
with an error if it's not a valid UTF-8 would be just as OK to me,
even though it might sound a bit radical. There are *lots* of warnings
in TeX files, and one can easily miss that one and miss the fact that
there are some broken characters somewhere in the last pages of a book
because of some problematic comments in the middle.)
Mojca
More information about the XeTeX
mailing list