[XeTeX] latin-1 encoded characters in commented out parts trigger log warnings

Sun Feb 21 23:49:59 CET 2021

On 21/02/2021 21:48, Bruno Le Floch wrote:
> I think your model of what XeTeX is doing is missing a step.  It's important to
> distinguish two steps, which are a bit mixed up in some of the comments here.
> I'm not 100\% sure either, so perhaps more knowledgeable people can chime in.
> 
> - The file is read line by line; this step requires finding the end of lines,
> hence must depend on some encoding (possibly XeTeX allows changing the encoding
> for lines that are not yet read).  This puts *characters* (not bytes) in a
> buffer.  This is also the step where the \endlinechar is inserted, so any change
> to \endlinechar on a given line can only affect the next line.
> 
> - The characters are then turned into tokens, one token at a time.  Catcodes can
> be changed within a line, and they affect what characters will combine into
> tokens, even within the same line.
> 
> The problem here is at the first step, where XeTeX cannot find a valid line of
> characters in the given encoding.  It might be possible to use package hooks to
> change the encoding state for that particular package, but I haven't followed
> carefully these new LaTeX developments.

Thanks for this explanation, Bruno -- you're quite right, this is an 
issue at the initial step of reading the external file into the input 
buffer (of characters, not bytes), one line at a time. For this, the 
encoding must be known, and at this stage nothing TeX-ish such as 
\catcode values is yet in play.

Each input file has an encoding associated with it at the time it is 
opened. By default this will be UTF-8, but a different default can be 
set using \XeTeXdefaultencoding; so a workaround for this specific 
problem is to change the default before loading the package, and then 
reset it afterwards.

The encoding used to interpret the *current* input file can also be 
changed on the fly, using \XeTeXinputencoding. This will take effect for 
the *next* line after the line on which it occurs (which has, after all, 
already been decoded from bytes to characters on its way in to the 
buffer, before the \XeTeXinputencoding command could be recognized at all).

This means that if the xstring package maintainers *really* want to keep 
their file in Latin-1 (which I doubt), they could avoid the issue here 
by putting something like

   \ifXeTeX
     \XeTeXinputencoding "iso-8859-1"
   \fi

at the top of the file, before any non-ASCII characters occur. But I 
suspect the change of encoding was inadvertent and they should just 
change it back to utf-8, and the problem will go away.

JK