[tex-live] User names too longs or with diacritics on Windows

Sun Apr 14 16:42:40 CEST 2013

On 14/4/13 15:17, Khaled Hosny wrote:
> On Sun, Apr 14, 2013 at 02:59:07PM +0100, Philip TAYLOR wrote:
>> I wonder whether it might be worth bringing Jonathan Kew,
>> and/or the current maintainer(s) of XeTeX, into this
>> discussion; they must surely have a reasonable knowledge
>> of what additional steps are needed to ensure that
>> Windows utilities are Unicode-aware.
>
> The Windows port of XeTeX was done by Akira, so he is the expert.
>

Yes, as Khaled says. I've never even attempted to build xetex on 
windows; Akira has always done that (for which I am -extremely- grateful).

Having said that, I can take a guess at the problem illustrated in your 
example. XeTeX defaults to interpreting its input as utf-8 (unless it 
detects utf-16, by "sniffing" the first couple of bytes). However, I 
think the shell on windows is passing the command line to xetex using 
the system default codepage (CP1252 in your case, probably). So the "é" 
character in your filename, which in utf-8 would be encoded as the bytes 
<0xC3 0xA9>, is not received in that form but as the single byte <0xE9> 
("é" in CP1252). As far as xetex is concerned, that looks like the first 
byte of a multi-byte utf-8 sequence, so it tries to interpret it as 
such, and the result is "garbage".

To fix this, xetex would need to ask the system what the current 
codepage is, and convert the command-line from that codepage to unicode 
for its internal use.

Moreover, for messages written to the terminal to appear correctly, it 
would also need to convert those messages back from unicode to the 
system codepage - or avoid the issue by the use of ^^xx escapes, so that 
the terminal output is pure ascii.

That should deal with decoding the command line correctly, I think. I'm 
not sure whether the file-access APIs that xetex uses can actually use 
unicode filenames, or whether it would also need to convert back from 
unicode names (whether from the command line or from names used within 
documents) back to system codepage in order to actually open the file. 
That may depend whether it's using the posix APIs (probably depend on 
the system codepage) or windows-specific APIs that handle unicode 
natively. Akira would know more about this, I'm sure.

(In theory, I suspect similar encoding issues apply on *nix platforms, 
but the use of utf-8 as the default codepage is pretty widespread these 
days, so most people won't run into these problems.)

Oh, and as for (pdf)tex: it doesn't run into these problems because it 
treats the command line, like all input, simply as a string of bytes, 
without regard to encoding. Whatever bytes it receives there will 
presumably be passed unchanged to the file-access APIs. So things should 
normally work, although it may be unable to access files whose (unicode) 
name cannot be represented in the current system codepage.

Moreover, if I understood Zdeněk's message correctly, it sounded like 
there may sometimes be a mismatch between the codepage that the shell 
(cmd.exe) is using (and hence the byte sequence passed to the *tex 
process on the command line) and the codepage assumed by the APIs used 
to access files within the binary. If that's the case, things will 
indeed break.

JK