[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

Sun Feb 21 23:55:17 CET 2021

Hi David,

On 22 Feb 2021, at 8:43 am, David Carlisle <d.p.carlisle at gmail.com<mailto:d.p.carlisle at gmail.com>> wrote:

Surely the line-end characters are already known, and the bits&bytes
have been read up to that point *before* tokenisation.

This is not a pdflatex inputenc style utf-8 error failing to map a stream of tokens.

It is at the file reading stage and if you have the file encoding wrong you do not know reliably what are the ends of lines and you haven't interpreted it as tex at all, so the comment character really can't have an effect here.

Ummm. Is that really how XeTeX does it?
How then does Jonathan’s
   \XeTeXdefaultencoding "iso-8859-1”
ever work ?
Just a rhetorical question; don’t bother answering.   :-)

This mapping is invisible to the tex macro layer just as you can change the internal character code mapping in classic tex to take an ebcdic stream, if you do that then read an ascii file you get rubbish with no hope to recover.

So I don't think such a switch should be automatic to avoid reporting encoding errors.

I reported the issue at xstring here
https://framagit.org/unbonpetit/xstring/-/issues/4<https://framagit.org/unbonpetit/xstring/-/issues/4>

I looked at what you said here, and some of it doesn’t seem to be in accord with
my TeXLive installations.

viz.

/usr/local/texlive/2016/.../xstring.tex:\expandafter\ifx\csname @latexerr\endcsname\relax% on n'utilise pas LaTeX ?
/usr/local/texlive/2016/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
/usr/local/texlive/2016/.../xstring.tex:%   - Le package ne n\'ecessite plus LaTeX et est d\'esormais utilisable sous
/usr/local/texlive/2016/.../xstring.tex:%     Plain eTeX.
/usr/local/texlive/2017/.../xstring.tex:% conditions of the LaTeX Project Public License, either version 1.3
/usr/local/texlive/2017/.../xstring.tex:% and version 1.3 or later is part of all distributions of LaTeX
/usr/local/texlive/2017/.../xstring.tex:\expandafter\ifx\csname @latexerr\endcsname\relax% on n'utilise pas LaTeX ?
/usr/local/texlive/2017/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
/usr/local/texlive/2017/.../xstring.tex:%   - Le package ne n\'ecessite plus LaTeX et est d\'esormais utilisable sous
/usr/local/texlive/2017/.../xstring.tex:%     Plain eTeX.
/usr/local/texlive/2018/.../xstring.tex:% !TeX encoding = ISO-8859-1
/usr/local/texlive/2018/.../xstring.tex:% Licence    : Released under the LaTeX Project Public License v1.3c %
/usr/local/texlive/2018/.../xstring.tex:%     Plain eTeX.
/usr/local/texlive/2019/.../xstring.tex:% !TeX encoding = ISO-8859-1
/usr/local/texlive/2019/.../xstring.tex:% Licence    : Released under the LaTeX Project Public License v1.3c %
/usr/local/texlive/2019/.../xstring.tex:     Plain eTeX.

prior to 2018, the accents in comments used ASCII, so UTF-8, but not intentionally so.

In 2017, the accents in comments became  latin-1 chars.
A 1st line was added:  % !TeX encoding = ISO-8859-1
to indicate this.

Such directive comments are useless, except at the beginning of the main document source.
They are for Front-End software, not TeX processing, right?

Jonathan, David,
so far as I can tell, it was *never* in UTF-8 with preformed accents.

David

that says what follows next is to be interpreted in a different way to what came previously?
Until the next switch that returns to UTF-8 or whatever?

If XeTeX is based on eTeX, then this should be possible in that setting.

Even replacing by U+FFFD
is being lenient.

Why has the mouth not realised that this information is to be discarded?
Then no replacement is required at all.

The file reading has failed  before any tex accessible processing has happened (see the ebcdic example in the TeXBook)

OK.
But that’s changing the meaning of bit-order, yes?
Surely we can be past that.

\danger \TeX\ always uses the internal character code of Appendix~C
for the standard ASCII characters,
regardless of what external coding scheme actually appears in the files
being read.  Thus, |b| is 98 inside of \TeX\ even when your computer
normally deals with ^{EBCDIC} or some other non-ASCII scheme; the \TeX\
software has been set up to convert text files to internal code, and to
convert back to the external code when writing text files.

the file encoding is failing at the  "convert text files to internal code" stage which is before the line buffer of characters is consulted to produce the stream of tokens based on catcodes.

Yes, OK; so my model isn’t up to it, as Bruno said.
 … And Jonathan has commented.

Also pdfTeX has no trouble with an xstring example.
It just seems pretty crazy that the comments need to be altered
for that package to be used with XeTeX.

David

Cheers, and thanks for this discussion.

Ross

Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore at mq.edu.au<mailto:ross.moore at mq.edu.au>
http://www.maths.mq.edu.au
[cid:image001.png at 01D030BE.D37A46F0]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. <http://mq.edu.au/>
<http://mq.edu.au/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/xetex/attachments/20210221/91cc00b5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4605 bytes
Desc: image001.png
URL: <https://tug.org/pipermail/xetex/attachments/20210221/91cc00b5/attachment-0001.png>