[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

David Carlisle d.p.carlisle at gmail.com
Sun Feb 21 13:02:05 CET 2021

On Sun, 21 Feb 2021 at 11:47, Ross Moore <ross.moore at mq.edu.au> wrote:

> Hi David.
>
> On 21 Feb 2021, at 10:12 pm, David Carlisle <d.p.carlisle at gmail.com>
> wrote:
>
> I think that should be taken up with the xstring maintainers.
>
>
> Is  xstring  intended for use with XeTeX ?
> I suspect not.
> But anyway, there are still issues with this.
>
> (BTW, I wrote this before Jonathan Kew’s response.)
>
>
> I don't think there is any reasonable way to say you can comment out parts
> of a file in a different encoding.
>
>
> I’m not convinced that this ought to be correct for TeX-based software.
>
> TeX (not necessarily XeTeX) has always operated as a finite-state machine.
> It *should* be possible to say that this part is encoded as such-and-such,
> and a later part encoded differently.
>
> I fully understand that editor software external to TeX might well have
> difficulties
> with files that mix encodings this way, but TeX itself has always been
> byte-based
> and should remain that way.
>
> A comment character is meant to be viewed as saying that:
>  *everything else on this line is to be ignored*
> – that’s the impression given by TeX documentation.
>

But you only know it is a comment character if you can interpret the
incoming byte stream
If there are encoding errors in that byte stream then everything ls is
guess work.

In this particular case with mostly ascii text and a few latin-1 characters
it may be that you can guess that
the invalid utf-8 is in fact valid latin1 and interpret it that way, and
the guess would be right for this file
but what if the non-utf8 file were utf-16 or latin-2  or ... just guessing
the encoding (which means guessing where the line and so the comment ends)
is just guesswork.

> If it is the documentation that is incorrect, then it should certainly be
> clarified.
>
> For XeTeX and this particular example, it’s probably just a matter of
> checking
> that the non-UTF8 characters occur *after* a UTF-8  ‘%' , and not issuing
> an error message under these conditions.
> A warning, maybe, but not an error.
>

>
> The file encoding specifies the byte stream interpretation before any tex
> tokenization
> If the file can not be interpreted as utf-8 then it can't be interpreted
> at all.
>
>
> Why not?
> Why can you not have a macro — presumably best on a single line by itself –
>

there is an xetex   primitive that switches the encoding as Jonathan
showed, but  guessing a different encoding
if a file fails to decode properly against a specified encoding is a
dangerous game to play.
So I don't think such a switch should be automatic to avoid reporting
encoding errors.

I reported the issue at xstring here
https://framagit.org/unbonpetit/xstring/-/issues/4

David

that says what follows next is to be interpreted in a different way to what
> came previously?
> Until the next switch that returns to UTF-8 or whatever?
>
>
> If XeTeX is based on eTeX, then this should be possible in that setting.
>
>
> Even replacing by U+FFFD
> is being lenient.
>
> David
>
>
>
>
> On Sun, 21 Feb 2021 at 11:04, jfbu <jfbu at free.fr> wrote:
>
>> Hi,
>>
>> consider this
>>
>> \documentclass{article}
>> \usepackage{xstring}
>> \begin{document}
>> \end{document}
>>
>> and call it xexstring.tex
>>
>> Then xelatex xexstring triggers 136 warnings of the type
>>
>> Invalid UTF-8 byte or sequence at line 35 replaced by U+FFFD.
>>
>> Looking at file
>>
>> /usr/local/texlive/2020/texmf-dist/tex/generic/xstring/xstring.tex
>>
>> I see that this matches with use of latin-1 encoded characters in
>>
>> Notice that it is a not a user decision here to use a latin-1
>> encoded file.
>>
>> In fact I encountered this in a file I was given where
>> xstring package was loaded by another package.
>>
>> Regards,
>>
>> Jean-François
>>
>
>
> Cheers.
>
> Ross
>
>
>
>
