[XeTeX] handling malformed UTF-8 input
Jonathan Kew
jonathan_kew at sil.org
Wed Feb 20 00:18:11 CET 2008
On 19 Feb 2008, at 4:27 pm, Marcin Woliński wrote:
> I think however, that XeTeX could be more careful when reading
> malformed
> UTF-8 files. Since continuation bytes in UTF-8 sequences have to
> be of
> the form 10xxxxxx it would be safer to gobble only such bytes or at
> least not to treat ASCII characters as parts of UTF-8 sequences. That
> way the endline would be always interpreted as an endline and comments
> would always end where they should.
> Is that a change worth introducing?
OK, motivated by this I have just committed a patch to the xetex
repository that checks for valid UTF-8 sequences (when reading a file
as UTF-8, of course). If an invalid sequence is encountered, it will
give a warning (in the log, unless \tracingonline is positive), and
read the remainder of the file as "bytes". This will often be wrong,
of course, but it's as good a guess as any if the user hasn't
specified the proper encoding.
(Depending on the exact nature of the first "bad" byte or sequence,
you may get one occurrence of U+FFFD REPLACEMENT CHARACTER when XeTeX
encounters a problem; then the remainder of the text will be read
byte by byte.)
For those who are happy rebuilding xetex from source, I'd appreciate
knowing of any problems with your (real) UTF-8 files after this patch
is applied. As far as I know, valid files should be processed unchanged.
JK
More information about the XeTeX
mailing list