<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, 21 Feb 2021 at 20:27, Ross Moore <<a href="mailto:ross.moore@mq.edu.au">ross.moore@mq.edu.au</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div style="overflow-wrap: break-word;">

Hi David,<br>

<div><br><div>Surely the line-end characters are already known, and the bits&bytes </div>

<div>have been read up to that point *before* tokenisation.</div></div></div></blockquote><div><br></div><div>This is not a pdflatex inputenc style utf-8 error failing to map a stream of tokens.</div><div><br></div><div>It is at the file reading stage and if you have the file encoding wrong you do not know reliably what are the ends of lines and you haven't interpreted it as tex at all, so the comment character really can't have an effect here. This mapping is invisible to the tex macro layer just as you can change the internal character code mapping in classic tex to take an ebcdic stream, if you do that then read an ascii file you get rubbish with no hope to recover.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div>

<div>So provided the tokenisation of the comment character has occurred before</div>

<div>tackling what comes after it, why would there be a problem?</div>

<br>

<blockquote type="cite">

<div>

<div dir="ltr">

<div class="gmail_quote">

<div>... just guessing the encoding (which means guessing where the line and so the comment ends)</div>

<div>is just guesswork.<br>

</div>

</div>

</div>

</div>

</blockquote>

<div><br>

</div>

<div>No guesswork intended.</div>

<br>

<blockquote type="cite">

<div>

<div dir="ltr">

<div class="gmail_quote">

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<div>

<blockquote type="cite">

<div>

<div dir="ltr">

<div><br>

</div>

<div>The file encoding specifies the byte stream interpretation before any tex tokenization</div>

<div>If the file can not be interpreted as utf-8 then it can't be interpreted at all.

</div>

</div>

</div>

</blockquote>

<div><br>

</div>

<div>Why not?</div>

<div>Why can you not have a macro — presumably best on a single line by itself –</div>

</div>

</div>

</blockquote>

<div> </div>

<div>there is an xetex   primitive that switches the encoding as Jonathan showed, but  guessing a different encoding</div>

<div>if a file fails to decode properly against a specified encoding is a dangerous game to play.</div>

</div>

</div>

</div>

</blockquote>

<div><br>

</div>

<div>I don’t think anyone is asking for that.</div>

<div><br>

</div>

<div>I can imagine situations where coding for packages that used to work well </div>

<div>without UTF-8 may well be commented involving non-UTF-8 characters.</div>

<div>(Indeed, there could even be binary bit-mapped images within comment sections;</div>

<div>having bytes not intended to represent any characters at all, in any encoding.)</div></div></div></blockquote><div><br></div><div>That really isn't possible. You are decoding a byte stream as UTF-8, once you get to a section that does not decode you could delete it or replace it byte by byte by the Unicode replacement character but after that everything is guesswork and heuristics: just because some later section happens to decode without error doesn't mean it was correctly decoded as intended. Imagine if the section had been in UTF-16 rather than latin-1 it is quite possible to have a stream of bytes that is valid utf8 and valid utf-16  there is no way to step over a commented out utf-16 section and know when to switch back to utf-8.<br></div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div>

<div><br>

</div>

<div>If such files are now subjected to constraints that formerly did not exist,</div>

<div>then this is surely not a good thing.</div></div></div></blockquote><div><br></div><div>That is not what happened here.  the constraints always existed. It is not that the processing changed, the file, which used to be distributed in UTF-8, is now distributed in latin-1 so gives warnings if read as UTF-8.</div><div><br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div>

<div><br>

</div>

<div><br>

</div>

<div>Besides, not all the information required to build PDFs need be related to</div>

<div>putting characters onscreen, through the typesetting engine.</div>

<div><br>

</div>

<div>For example, when building fully-tagged PDFs, there can easily be more information</div>

<div>overall within the tagging (both structure and content) than in the visual content itself. </div>

<div>Thank goodness for Heiko’s packages that allow for re-encoding strings between</div>

<div>different formats that are valid for inclusion within parts of a PDF.</div></div></div></blockquote><div><br></div><div>But the packages require the files to be read correctly, and that is what is not happening.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div>

<div><br>

</div>

<div>I’m thinking here about how a section-title appears in:</div>

<div> bookmarks, ToC entries, tag-titles, /Alt strings, annotation text for hyperlinking, etc.</div>

<div>as well as visually typeset for on-screen.</div>

<div>These different representations need to be either derivable from a common source,</div>

<div>or passed in as extra information, encoded appropriately (and not necessarily UTF-8).</div>

<div><br>

</div></div></div></blockquote><div>Sure but that is not related to the problem here, which is that the source file  can not be read or rather that it is being incorrectly read as UTF-8 when it is latin-1.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;"><div>

<blockquote type="cite">

<div>

<div dir="ltr">

<div class="gmail_quote">

<div>So I don't think such a switch should be automatic to avoid reporting encoding errors.<br>

</div>

<div><br>

</div>

<div>I reported the issue at xstring here</div>

<div><a href="https://framagit.org/unbonpetit/xstring/-/issues/4" target="_blank">https://framagit.org/unbonpetit/xstring/-/issues/4</a></div>

<div><br>

</div>

<div><br>

</div>

<div>David</div>

<div><br>

</div>

<div><br>

</div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>

<div>

<div>that says what follows next is to be interpreted in a different way to what came previously?</div>

<div>Until the next switch that returns to UTF-8 or whatever?</div>

<div><br>

</div>

<div><br>

</div>

<div>If XeTeX is based on eTeX, then this should be possible in that setting.</div>

<div><br>

</div>

<div><br>

</div>

<blockquote type="cite">

<div>

<div dir="ltr">

<div>Even replacing by U+FFFD <br>

</div>

<div>is being lenient.</div>

</div>

</div>

</blockquote>

</div>

</div>

</blockquote>

</div>

</div>

</div>

</blockquote>

<div><br>

</div>

<div>Why has the mouth not realised that this information is to be discarded?</div>

<div>Then no replacement is required at all.</div></div></div></blockquote><div><br></div><div>The file reading has failed  before any tex accessible processing has happened (see the ebcdic example in the TeXBook)</div><div><br></div><div>\danger \TeX\ always uses the internal character code of Appendix~C<br>for the standard ASCII characters,<br>regardless of what external coding scheme actually appears in the files<br>being read.  Thus, |b| is 98 inside of \TeX\ even when your computer<br>normally deals with ^{EBCDIC} or some other non-ASCII scheme; the \TeX\<br>software has been set up to convert text files to internal code, and to<br>convert back to the external code when writing text files.</div><div><br></div><div><br></div><div>the file encoding is failing at the  "convert text files to internal code" stage which is before the line buffer of characters is consulted to produce the stream of tokens based on catcodes.</div><div><br></div><div><br></div><div><br></div></div><div class="gmail_quote">David</div><div class="gmail_quote"><br></div></div>