[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

Jonathan Kew jfkthame at gmail.com
Mon Feb 22 00:27:37 CET 2021


On 21/02/2021 22:55, Ross Moore wrote:
> Hi David,
> 
>> On 22 Feb 2021, at 8:43 am, David Carlisle <d.p.carlisle at gmail.com 
>> <mailto:d.p.carlisle at gmail.com>> wrote:
> 
>>     Surely the line-end characters are already known, and the bits&bytes
>>     have been read up to that point *before* tokenisation.
>>
>>
>> This is not a pdflatex inputenc style utf-8 error failing to map a 
>> stream of tokens.
>>
>> It is at the file reading stage and if you have the file encoding 
>> wrong you do not know reliably what are the ends of lines and you 
>> haven't interpreted it as tex at all, so the comment character really 
>> can't have an effect here.
> 
> Ummm. Is that really how XeTeX does it?
> How then does Jonathan’s
>     \XeTeXdefaultencoding "iso-8859-1”
> ever work ?
> Just a rhetorical question; don’t bother answering.   :-)
> 
>> This mapping is invisible to the tex macro layer just as you can 
>> change the internal character code mapping in classic tex to take an 
>> ebcdic stream, if you do that then read an ascii file you get rubbish 
>> with no hope to recover.
>>
> 
> 
>>>     So I don't think such a switch should be automatic to avoid
>>>     reporting encoding errors.
>>>
>>>     I reported the issue at xstring here
>>>     https://framagit.org/unbonpetit/xstring/-/issues/4
>>>     <https://framagit.org/unbonpetit/xstring/-/issues/4>
>>>
> 
> I looked at what you said here, and some of it doesn’t seem to be in 
> accord with
> my TeXLive installations.
> 
> viz.
> 
> /usr/local/texlive/2016/.../xstring.tex:\expandafter\ifx\csname 
> @latexerr\endcsname\relax% on n'utilise pas LaTeX ?
> /usr/local/texlive/2016/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
> /usr/local/texlive/2016/.../xstring.tex:%   - Le package ne n\'ecessite 
> plus LaTeX et est d\'esormais utilisable sous
> /usr/local/texlive/2016/.../xstring.tex:%     Plain eTeX.
> /usr/local/texlive/2017/.../xstring.tex:% conditions of the LaTeX 
> Project Public License, either version 1.3
> /usr/local/texlive/2017/.../xstring.tex:% and version 1.3 or later is 
> part of all distributions of LaTeX
> /usr/local/texlive/2017/.../xstring.tex:\expandafter\ifx\csname 
> @latexerr\endcsname\relax% on n'utilise pas LaTeX ?
> /usr/local/texlive/2017/.../xstring.tex:\fi% fin des d\'efinitions LaTeX
> /usr/local/texlive/2017/.../xstring.tex:%   - Le package ne n\'ecessite 
> plus LaTeX et est d\'esormais utilisable sous
> /usr/local/texlive/2017/.../xstring.tex:%     Plain eTeX.
> /usr/local/texlive/2018/.../xstring.tex:% !TeX encoding = ISO-8859-1
> /usr/local/texlive/2018/.../xstring.tex:% Licence    : Released under 
> the LaTeX Project Public License v1.3c %
> /usr/local/texlive/2018/.../xstring.tex:%     Plain eTeX.
> /usr/local/texlive/2019/.../xstring.tex:% !TeX encoding = ISO-8859-1
> /usr/local/texlive/2019/.../xstring.tex:% Licence    : Released under 
> the LaTeX Project Public License v1.3c %
> /usr/local/texlive/2019/.../xstring.tex:     Plain eTeX.
> 
> prior to 2018, the accents in comments used ASCII, so UTF-8, but not 
> intentionally so.
> 
> In 2017, the accents in comments became  latin-1 chars.
> A 1st line was added: % !TeX encoding = ISO-8859-1
> to indicate this.
> 
> Such directive comments are useless, except at the beginning of the main 
> document source.
> They are for Front-End software, not TeX processing, right?

They're for front-end software, but not only for the main document 
source; any file could have an encoding directive to tell the editor how 
to load/save it.

> 
> Jonathan, David,
> so far as I can tell, it was *never* in UTF-8 with preformed accents.
> 


I have a copy of xstring.tex here (in an old TeXlive tree) that is dated

   \def\xstringversion     {1.7c}
   \def\xstringdate        {2013/10/13}

where many of the accents (in comments) are encoded "TeX-style" with 
control sequences, but there are also some that are literal accented 
letters -- and they're in utf-8. If I load this file as Latin-1 in my 
editor, those letters are garbled.

(They're even mixed with the TeX-style sequences within a single line, 
sometimes:

% 2) Ensuite, on d\'etokenize ce d\'eveloppement de façon n'avoir plus que

Notice what happened to "façon" there when read as Latin-1...)

It does sound like they later did a deliberate conversion to Latin-1 
(contrary to what I was guessing); this is unfortunate, in that it means 
the file will be mis-read by software that expects UTF-8, which is the 
de facto default encoding for text these days.

So I think switching to UTF-8 would be a better choice; if they don't 
want to do that, adding a \XeTeXinputencoding line would be helpful.

JK


More information about the XeTeX mailing list.