[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

Sun Feb 21 22:48:34 CET 2021

Hi Ross,

On 2/21/21 10:42 PM, Ross Moore wrote:
> Hi Ulrike,
> 
>> On 22 Feb 2021, at 7:52 am, Ulrike Fischer wrote:
>>
>> Am Sun, 21 Feb 2021 20:26:04 +0000 schrieb Ross Moore:
>>
>> > Once you have encountered the (correct) comment character,
>> > what follows on the rest of the line is going to be discarded,
>> > so its encoding is surely irrelevant.
>> >
>> > Why should the whole line need to be fully tokenised,
>> > before the decision is taken as to what part of it is retained?
>>
>> Well you need to find the end of the line to know where to stop with
>> the discarding don't you? So you need to inspect the part after the
>> comment char until you find something that says "newline”.
> 
> My understanding is that this *is* done first.
> Similarly to TeX's  \read  to  <csname>  which grabs a line of input from a file, 
> before doing the tokenisation and storing the result in the <csname>.
>    page 217 of The TeXbook
> 
> If I’m wrong with this, for high-speed input, then yes you need to know where to
> stop.
> But that’s just as easy, since you stop when a byte is to be tokenised
> as an end-of-line character, and these are known. 
> You need this anyway, even when you have tokenised every byte.
> 
> 
> So all we are saying is that when handling the bytes between
> a comment and its end-of-line, just be a bit more careful.
> 
> It’s not necessary for each byte to be tokenised as valid for UTF-8.
> Maybe change the (Warning) message when you know that you are within
> such a comment, to say so.  That would be more meaningful to a package-writer, 
> and to an author who uses the package, looks in the .log file, and sees the message.
> 
> None of this is changing how the file is ultimately processed;
> it’s just about being friendlier in the human interface.

I think your model of what XeTeX is doing is missing a step.  It's important to
distinguish two steps, which are a bit mixed up in some of the comments here.
I'm not 100\% sure either, so perhaps more knowledgeable people can chime in.

- The file is read line by line; this step requires finding the end of lines,
hence must depend on some encoding (possibly XeTeX allows changing the encoding
for lines that are not yet read).  This puts *characters* (not bytes) in a
buffer.  This is also the step where the \endlinechar is inserted, so any change
to \endlinechar on a given line can only affect the next line.

- The characters are then turned into tokens, one token at a time.  Catcodes can
be changed within a line, and they affect what characters will combine into
tokens, even within the same line.

The problem here is at the first step, where XeTeX cannot find a valid line of
characters in the given encoding.  It might be possible to use package hooks to
change the encoding state for that particular package, but I haven't followed
carefully these new LaTeX developments.

Best,
Bruno