[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

Sun Feb 21 22:43:56 CET 2021

On Sun, 21 Feb 2021 at 20:27, Ross Moore <ross.moore at mq.edu.au> wrote:

> Hi David,
>
> Surely the line-end characters are already known, and the bits&bytes
> have been read up to that point *before* tokenisation.
>

This is not a pdflatex inputenc style utf-8 error failing to map a stream
of tokens.

It is at the file reading stage and if you have the file encoding wrong you
do not know reliably what are the ends of lines and you haven't interpreted
it as tex at all, so the comment character really can't have an effect
here. This mapping is invisible to the tex macro layer just as you can
change the internal character code mapping in classic tex to take an ebcdic
stream, if you do that then read an ascii file you get rubbish with no hope
to recover.

So provided the tokenisation of the comment character has occurred before
> tackling what comes after it, why would there be a problem?
>
> ... just guessing the encoding (which means guessing where the line and so
> the comment ends)
> is just guesswork.
>
>
> No guesswork intended.
>
>
>> The file encoding specifies the byte stream interpretation before any tex
>> tokenization
>> If the file can not be interpreted as utf-8 then it can't be interpreted
>> at all.
>>
>>
>> Why not?
>> Why can you not have a macro — presumably best on a single line by itself
>> –
>>
>
> there is an xetex   primitive that switches the encoding as Jonathan
> showed, but  guessing a different encoding
> if a file fails to decode properly against a specified encoding is a
> dangerous game to play.
>
>
> I don’t think anyone is asking for that.
>
> I can imagine situations where coding for packages that used to work well
> without UTF-8 may well be commented involving non-UTF-8 characters.
> (Indeed, there could even be binary bit-mapped images within comment
> sections;
> having bytes not intended to represent any characters at all, in any
> encoding.)
>

That really isn't possible. You are decoding a byte stream as UTF-8, once
you get to a section that does not decode you could delete it or replace it
byte by byte by the Unicode replacement character but after that everything
is guesswork and heuristics: just because some later section happens to
decode without error doesn't mean it was correctly decoded as intended.
Imagine if the section had been in UTF-16 rather than latin-1 it is quite
possible to have a stream of bytes that is valid utf8 and valid utf-16
there is no way to step over a commented out utf-16 section and know when
to switch back to utf-8.

> If such files are now subjected to constraints that formerly did not exist,
> then this is surely not a good thing.
>

That is not what happened here.  the constraints always existed. It is not
that the processing changed, the file, which used to be distributed in
UTF-8, is now distributed in latin-1 so gives warnings if read as UTF-8.

>
> Besides, not all the information required to build PDFs need be related to
> putting characters onscreen, through the typesetting engine.
>
> For example, when building fully-tagged PDFs, there can easily be more
> information
> overall within the tagging (both structure and content) than in the visual
> content itself.
> Thank goodness for Heiko’s packages that allow for re-encoding strings
> between
> different formats that are valid for inclusion within parts of a PDF.
>

But the packages require the files to be read correctly, and that is what
is not happening.

> I’m thinking here about how a section-title appears in:
>  bookmarks, ToC entries, tag-titles, /Alt strings, annotation text for
> hyperlinking, etc.
> as well as visually typeset for on-screen.
> These different representations need to be either derivable from a common
> source,
> or passed in as extra information, encoded appropriately (and not
> necessarily UTF-8).
>
> Sure but that is not related to the problem here, which is that the source
file  can not be read or rather that it is being incorrectly read as UTF-8
when it is latin-1.

So I don't think such a switch should be automatic to avoid reporting
> encoding errors.
>
> I reported the issue at xstring here
> https://framagit.org/unbonpetit/xstring/-/issues/4
>
>
> David
>
>
> that says what follows next is to be interpreted in a different way to
>> what came previously?
>> Until the next switch that returns to UTF-8 or whatever?
>>
>>
>> If XeTeX is based on eTeX, then this should be possible in that setting.
>>
>>
>> Even replacing by U+FFFD
>> is being lenient.
>>
>>
> Why has the mouth not realised that this information is to be discarded?
> Then no replacement is required at all.
>

The file reading has failed  before any tex accessible processing has
happened (see the ebcdic example in the TeXBook)

\danger \TeX\ always uses the internal character code of Appendix~C
for the standard ASCII characters,
regardless of what external coding scheme actually appears in the files
being read.  Thus, |b| is 98 inside of \TeX\ even when your computer
normally deals with ^{EBCDIC} or some other non-ASCII scheme; the \TeX\
software has been set up to convert text files to internal code, and to
convert back to the external code when writing text files.

the file encoding is failing at the  "convert text files to internal code"
stage which is before the line buffer of characters is consulted to produce
the stream of tokens based on catcodes.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/xetex/attachments/20210221/2759b9df/attachment.html>