[XeTeX] xetex doesn't recognize/replace all invalid utf8 bytes
Ross Moore
ross at ics.mq.edu.au
Tue Dec 29 23:49:17 CET 2009
Hi Ulrike,
On 30/12/2009, at 4:05 AM, Ulrike Fischer wrote:
> Am Tue, 29 Dec 2009 08:58:50 -0800 (PST) schrieb Apostolos
> Syropoulos:
>
>
>> Here is what I get while the log file contents follow
>
> Well I would say you did save the file in utf8. My file is in
> ansinew: "... when I compile the following example encoded as
> ansinew (cp1252)..."
I get the same as you, with TeXshop opening your file as
Latin1-encoded.
viz.
>> Invalid UTF-8 byte or sequence at line 8 replaced by U+FFFD.
>> Missing character: There is no � in font [lmroman10-regular]!
>> Invalid UTF-8 byte or sequence at line 11 replaced by U+FFFD.
>> Missing character: There is no � in font [lmroman10-regular]!
But beware, if you edit anything and resave within the TeXshop shell,
then the encoding can be converted to something else.
Then the errors go away, as the file has changed --- which, of course,
is a disaster for the intended information.
Within the PDF I get:
stream
q 1 0 0 1 72 769.82 cm 0 G 0 g BT /F1 9.963 Tf
76.71 -62.76 Td[<005b0086005b>]TJ
0 -11.96 Td[<005b0a47005b>]TJ
0 -11.95 Td[<005b0a47005b>]TJ
...
so that your xA7 byte has become x86 which is meant to be
unallocated in Unicode, yet still displays as a character in
the output --- with LMroman10, and also with Charis SIL.
The behaviour would seem to be that a "lonely continuation byte"
gets accepted as if a full Latin-1 (or similar) character.
But when the 1st byte is a UTF8 header-byte (matching 11?????? )
and there are not sufficient data bytes, then XeTeX throws up
the "Invalid UTF-8 byte or sequence" warning in the .log .
Is this the best behaviour?
I think we need an explanation from JK.
>
> --
> Ulrike Fischer
Cheers,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the XeTeX
mailing list