[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings
jfkthame at gmail.com
Sun Feb 21 23:49:59 CET 2021
On 21/02/2021 21:48, Bruno Le Floch wrote:
> I think your model of what XeTeX is doing is missing a step. It's important to
> distinguish two steps, which are a bit mixed up in some of the comments here.
> I'm not 100\% sure either, so perhaps more knowledgeable people can chime in.
> - The file is read line by line; this step requires finding the end of lines,
> hence must depend on some encoding (possibly XeTeX allows changing the encoding
> for lines that are not yet read). This puts *characters* (not bytes) in a
> buffer. This is also the step where the \endlinechar is inserted, so any change
> to \endlinechar on a given line can only affect the next line.
> - The characters are then turned into tokens, one token at a time. Catcodes can
> be changed within a line, and they affect what characters will combine into
> tokens, even within the same line.
> The problem here is at the first step, where XeTeX cannot find a valid line of
> characters in the given encoding. It might be possible to use package hooks to
> change the encoding state for that particular package, but I haven't followed
> carefully these new LaTeX developments.
Thanks for this explanation, Bruno -- you're quite right, this is an
issue at the initial step of reading the external file into the input
buffer (of characters, not bytes), one line at a time. For this, the
encoding must be known, and at this stage nothing TeX-ish such as
\catcode values is yet in play.
Each input file has an encoding associated with it at the time it is
opened. By default this will be UTF-8, but a different default can be
set using \XeTeXdefaultencoding; so a workaround for this specific
problem is to change the default before loading the package, and then
reset it afterwards.
The encoding used to interpret the *current* input file can also be
changed on the fly, using \XeTeXinputencoding. This will take effect for
the *next* line after the line on which it occurs (which has, after all,
already been decoded from bytes to characters on its way in to the
buffer, before the \XeTeXinputencoding command could be recognized at all).
This means that if the xstring package maintainers *really* want to keep
their file in Latin-1 (which I doubt), they could avoid the issue here
by putting something like
at the top of the file, before any non-ASCII characters occur. But I
suspect the change of encoding was inadvertent and they should just
change it back to utf-8, and the problem will go away.
More information about the XeTeX