[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings
Bruno Le Floch
blflatex at gmail.com
Sun Feb 21 22:48:34 CET 2021
Hi Ross,
On 2/21/21 10:42 PM, Ross Moore wrote:
> Hi Ulrike,
>
>> On 22 Feb 2021, at 7:52 am, Ulrike Fischer wrote:
>>
>> Am Sun, 21 Feb 2021 20:26:04 +0000 schrieb Ross Moore:
>>
>> > Once you have encountered the (correct) comment character,
>> > what follows on the rest of the line is going to be discarded,
>> > so its encoding is surely irrelevant.
>> >
>> > Why should the whole line need to be fully tokenised,
>> > before the decision is taken as to what part of it is retained?
>>
>> Well you need to find the end of the line to know where to stop with
>> the discarding don't you? So you need to inspect the part after the
>> comment char until you find something that says "newline”.
>
> My understanding is that this *is* done first.
> Similarly to TeX's \read to <csname> which grabs a line of input from a file,
> before doing the tokenisation and storing the result in the <csname>.
> page 217 of The TeXbook
>
> If I’m wrong with this, for high-speed input, then yes you need to know where to
> stop.
> But that’s just as easy, since you stop when a byte is to be tokenised
> as an end-of-line character, and these are known.
> You need this anyway, even when you have tokenised every byte.
>
>
> So all we are saying is that when handling the bytes between
> a comment and its end-of-line, just be a bit more careful.
>
> It’s not necessary for each byte to be tokenised as valid for UTF-8.
> Maybe change the (Warning) message when you know that you are within
> such a comment, to say so. That would be more meaningful to a package-writer,
> and to an author who uses the package, looks in the .log file, and sees the message.
>
> None of this is changing how the file is ultimately processed;
> it’s just about being friendlier in the human interface.
I think your model of what XeTeX is doing is missing a step. It's important to
distinguish two steps, which are a bit mixed up in some of the comments here.
I'm not 100\% sure either, so perhaps more knowledgeable people can chime in.
- The file is read line by line; this step requires finding the end of lines,
hence must depend on some encoding (possibly XeTeX allows changing the encoding
for lines that are not yet read). This puts *characters* (not bytes) in a
buffer. This is also the step where the \endlinechar is inserted, so any change
to \endlinechar on a given line can only affect the next line.
- The characters are then turned into tokens, one token at a time. Catcodes can
be changed within a line, and they affect what characters will combine into
tokens, even within the same line.
The problem here is at the first step, where XeTeX cannot find a valid line of
characters in the given encoding. It might be possible to use package hooks to
change the encoding state for that particular package, but I haven't followed
carefully these new LaTeX developments.
Best,
Bruno
More information about the XeTeX
mailing list.