[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings
Jonathan Kew
jfkthame at gmail.com
Mon Feb 22 00:27:37 CET 2021
On 21/02/2021 22:55, Ross Moore wrote:
> Hi David,
>
>> On 22 Feb 2021, at 8:43 am, David Carlisle <d.p.carlisle at gmail.com
>> <mailto:d.p.carlisle at gmail.com>> wrote:
>
>> Surely the line-end characters are already known, and the bits&bytes
>> have been read up to that point *before* tokenisation.
>>
>>
>> This is not a pdflatex inputenc style utf-8 error failing to map a
>> stream of tokens.
>>
>> It is at the file reading stage and if you have the file encoding
>> wrong you do not know reliably what are the ends of lines and you
>> haven't interpreted it as tex at all, so the comment character really
>> can't have an effect here.
>
> Ummm. Is that really how XeTeX does it?
> How then does Jonathan’s
> \XeTeXdefaultencoding "iso-8859-1”
> ever work ?
> Just a rhetorical question; don’t bother answering. :-)
>
>> This mapping is invisible to the tex macro layer just as you can
>> change the internal character code mapping in classic tex to take an
>> ebcdic stream, if you do that then read an ascii file you get rubbish
>> with no hope to recover.
>>
>
>
>>> So I don't think such a switch should be automatic to avoid
>>> reporting encoding errors.
>>>
>>> I reported the issue at xstring here
>>> https://framagit.org/unbonpetit/xstring/-/issues/4
>>> <https://framagit.org/unbonpetit/xstring/-/issues/4>
>>>
>
> I looked at what you said here, and some of it doesn’t seem to be in
> accord with
> my TeXLive installations.
>
> viz.
>
> /usr/local/texlive/2016/.../xstring.tex:\expandafter\ifx\csname
> @latexerr\endcsname\relax% on n'utilise pas LaTeX ?
> /usr/local/texlive/2016/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
> /usr/local/texlive/2016/.../xstring.tex:% - Le package ne n\'ecessite
> plus LaTeX et est d\'esormais utilisable sous
> /usr/local/texlive/2016/.../xstring.tex:% Plain eTeX.
> /usr/local/texlive/2017/.../xstring.tex:% conditions of the LaTeX
> Project Public License, either version 1.3
> /usr/local/texlive/2017/.../xstring.tex:% and version 1.3 or later is
> part of all distributions of LaTeX
> /usr/local/texlive/2017/.../xstring.tex:\expandafter\ifx\csname
> @latexerr\endcsname\relax% on n'utilise pas LaTeX ?
> /usr/local/texlive/2017/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
> /usr/local/texlive/2017/.../xstring.tex:% - Le package ne n\'ecessite
> plus LaTeX et est d\'esormais utilisable sous
> /usr/local/texlive/2017/.../xstring.tex:% Plain eTeX.
> /usr/local/texlive/2018/.../xstring.tex:% !TeX encoding = ISO-8859-1
> /usr/local/texlive/2018/.../xstring.tex:% Licence : Released under
> the LaTeX Project Public License v1.3c %
> /usr/local/texlive/2018/.../xstring.tex:% Plain eTeX.
> /usr/local/texlive/2019/.../xstring.tex:% !TeX encoding = ISO-8859-1
> /usr/local/texlive/2019/.../xstring.tex:% Licence : Released under
> the LaTeX Project Public License v1.3c %
> /usr/local/texlive/2019/.../xstring.tex: Plain eTeX.
>
> prior to 2018, the accents in comments used ASCII, so UTF-8, but not
> intentionally so.
>
> In 2017, the accents in comments became latin-1 chars.
> A 1st line was added: % !TeX encoding = ISO-8859-1
> to indicate this.
>
> Such directive comments are useless, except at the beginning of the main
> document source.
> They are for Front-End software, not TeX processing, right?
They're for front-end software, but not only for the main document
source; any file could have an encoding directive to tell the editor how
to load/save it.
>
> Jonathan, David,
> so far as I can tell, it was *never* in UTF-8 with preformed accents.
>
I have a copy of xstring.tex here (in an old TeXlive tree) that is dated
\def\xstringversion {1.7c}
\def\xstringdate {2013/10/13}
where many of the accents (in comments) are encoded "TeX-style" with
control sequences, but there are also some that are literal accented
letters -- and they're in utf-8. If I load this file as Latin-1 in my
editor, those letters are garbled.
(They're even mixed with the TeX-style sequences within a single line,
sometimes:
% 2) Ensuite, on d\'etokenize ce d\'eveloppement de façon n'avoir plus que
Notice what happened to "façon" there when read as Latin-1...)
It does sound like they later did a deliberate conversion to Latin-1
(contrary to what I was guessing); this is unfortunate, in that it means
the file will be mis-read by software that expects UTF-8, which is the
de facto default encoding for text these days.
So I think switching to UTF-8 would be a better choice; if they don't
want to do that, adding a \XeTeXinputencoding line would be helpful.
JK
More information about the XeTeX
mailing list.