[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

Ross Moore ross.moore at mq.edu.au
Sun Feb 21 21:26:04 CET 2021

Hi David,

On 21 Feb 2021, at 11:02 pm, David Carlisle <d.p.carlisle at gmail.com<mailto:d.p.carlisle at gmail.com>> wrote:

I don't think there is any reasonable way to say you can comment out parts of a file in a different encoding.

I’m not convinced that this ought to be correct for TeX-based software.

TeX (not necessarily XeTeX) has always operated as a finite-state machine.
It *should* be possible to say that this part is encoded as such-and-such,
and a later part encoded differently.

I fully understand that editor software external to TeX might well have difficulties
with files that mix encodings this way, but TeX itself has always been byte-based
and should remain that way.

A comment character is meant to be viewed as saying that:
 *everything else on this line is to be ignored*
– that’s the impression given by TeX documentation.

But you only know it is a comment character if you can interpret the incoming byte stream
If there are encoding errors in that byte stream then everything ls is guess work.

Who said anything about errors in the byte stream?
Once you have encountered the (correct) comment character,
what follows on the rest of the line is going to be discarded,
so its encoding is surely irrelevant.

Why should the whole line need to be fully tokenised,
before the decision is taken as to what part of it is retained?

In the case of a package file, rather than author input for typesetting,
the intention of the coding is completely unknown,
is probably all ASCII anyway, except (as in this case) for comments intended
for human eyes only, following a properly declared comment-character.

In this particular case with mostly ascii text and a few latin-1 characters it may be that you can guess that
the invalid utf-8 is in fact valid latin1 and interpret it that way,

You don’t need to interpret it as anything; that part is to be discarded.

and the guess would be right for this file
but what if the non-utf8 file were utf-16 or latin-2  or

Surely the line-end characters are already known, and the bits&bytes
have been read up to that point *before* tokenisation.
So provided the tokenisation of the comment character has occurred before
tackling what comes after it, why would there be a problem?

... just guessing the encoding (which means guessing where the line and so the comment ends)
is just guesswork.

No guesswork intended.

The file encoding specifies the byte stream interpretation before any tex tokenization
If the file can not be interpreted as utf-8 then it can't be interpreted at all.

Why not?
Why can you not have a macro — presumably best on a single line by itself –

there is an xetex   primitive that switches the encoding as Jonathan showed, but  guessing a different encoding
if a file fails to decode properly against a specified encoding is a dangerous game to play.

I don’t think anyone is asking for that.

I can imagine situations where coding for packages that used to work well
without UTF-8 may well be commented involving non-UTF-8 characters.
(Indeed, there could even be binary bit-mapped images within comment sections;
having bytes not intended to represent any characters at all, in any encoding.)

If such files are now subjected to constraints that formerly did not exist,
then this is surely not a good thing.

Besides, not all the information required to build PDFs need be related to
putting characters onscreen, through the typesetting engine.

For example, when building fully-tagged PDFs, there can easily be more information
overall within the tagging (both structure and content) than in the visual content itself.
Thank goodness for Heiko’s packages that allow for re-encoding strings between
different formats that are valid for inclusion within parts of a PDF.

I’m thinking here about how a section-title appears in:
 bookmarks, ToC entries, tag-titles, /Alt strings, annotation text for hyperlinking, etc.
as well as visually typeset for on-screen.
These different representations need to be either derivable from a common source,
or passed in as extra information, encoded appropriately (and not necessarily UTF-8).

So I don't think such a switch should be automatic to avoid reporting encoding errors.

I reported the issue at xstring here


that says what follows next is to be interpreted in a different way to what came previously?
Until the next switch that returns to UTF-8 or whatever?

If XeTeX is based on eTeX, then this should be possible in that setting.

Even replacing by U+FFFD
is being lenient.

Why has the mouth not realised that this information is to be discarded?
Then no replacement is required at all.


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore at mq.edu.au<mailto:ross.moore at mq.edu.au>
[cid:image001.png at 01D030BE.D37A46F0]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. <http://mq.edu.au/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/xetex/attachments/20210221/72a34ae5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4605 bytes
Desc: image001.png
URL: <https://tug.org/pipermail/xetex/attachments/20210221/72a34ae5/attachment-0001.png>

More information about the XeTeX mailing list.