[XeTeX] xetex doesn't recognize/replace all invalid utf8 bytes

Ross Moore ross at ics.mq.edu.au
Tue Dec 29 23:49:17 CET 2009

Hi Ulrike,

On 30/12/2009, at 4:05 AM, Ulrike Fischer wrote:

> Am Tue, 29 Dec 2009 08:58:50 -0800 (PST) schrieb Apostolos
> Syropoulos:
>> Here is what I get while the log file contents follow
> Well I would say you did save the file in utf8. My file is in
> ansinew: "... when I compile the following example encoded as
> ansinew (cp1252)..."

I get the same as you, with TeXshop opening your file as

>> Invalid UTF-8 byte or sequence at line 8 replaced by U+FFFD.
>> Missing character: There is no � in font [lmroman10-regular]!
>> Invalid UTF-8 byte or sequence at line 11 replaced by U+FFFD.
>> Missing character: There is no � in font [lmroman10-regular]!

But beware, if you edit anything and resave within the TeXshop shell,
then the encoding can be converted to something else.
Then the errors go away, as the file has changed --- which, of course,
is a disaster for the intended information.

Within the PDF I get:

  q 1 0 0 1 72 769.82 cm 0 G 0 g BT /F1 9.963 Tf
  76.71 -62.76 Td[<005b0086005b>]TJ
   0 -11.96 Td[<005b0a47005b>]TJ
   0 -11.95 Td[<005b0a47005b>]TJ

so that your  xA7 byte has become  x86  which is meant to be
unallocated in Unicode, yet still displays as a character in
the output --- with LMroman10, and also with Charis SIL.

The behaviour would seem to be that a "lonely continuation byte"
gets accepted as if a full Latin-1 (or similar) character.
But when the 1st byte is a UTF8 header-byte (matching 11?????? )
and there are not sufficient data bytes, then XeTeX throws up
the "Invalid UTF-8 byte or sequence" warning in the .log .

Is this the best behaviour?
I think we need an explanation from JK.

> -- 
> Ulrike Fischer



Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114

More information about the XeTeX mailing list