[luatex] [lltx] [tex-live] Location of recorder file

Philipp Stephani st_philipp at yahoo.de
Sat May 14 02:43:31 CEST 2011


Am 14.05.2011 um 01:42 schrieb Reinhard Kotucha:

> On 2011-05-13 at 13:41:32 +0200, Philipp Stephani wrote:
> 
>> Am 13.05.2011 um 13:15 schrieb Heiko Oberdiek:
>> 
>>> I think, the file name interfaces should be transparent
>>> in the sense, that all characters are supported and
>>> the user is only hit by the restrictions of the operating
>>> system or the file system, but not by artificial restrictions
>>> by the software inbetween.
>> 
>> I agree, but this will never happen: try to use a Unicode file name
>> on Windows in any engine ... even though they are perfectly legal
>> as far as the operating system is concerned, no TeX processor ever
>> will accept them because fopen doesn't accept Unicode file names on
>> Windows. Either all engines switch to a nonstandard C runtime that
>> interprets file names as UTF-8, or all engines are rewritten to
>> avoid fopen on Windows. Both are extremely unlikely to ever happen.
> 
> I must admit that I don't understand.  First of all, when talking
> about character encodings, I don't know what "the operating system"
> means.  AKAIK, filenames are stored as UTF-8 in NTFS (don't know
> whether FAT supports UTF-8).  

It uses UTF-16 (both on NTFS and VFAT). The operating system functions expect UTF-16 as well.
The difference between Windows and Unix-like systems in this respect is that the latter expect an opaque byte string that is simply passed through to the file system, while Windows expects an UTF-16 string.

> 
> My question is how and where this is implemented.
> 
> The user interfaces are using different encodings, in a German
> Windows, the Exploder uses CP1252 and cmd.exe is using CP850.  I would
> expect that they translate filenames to UTF-8 internally.

Everything in Windows is UTF-16, there is no UTF-8 or other encodings involved. The console does use CP850 (or similar) sometimes, but only to support legacy programs; internally it uses UTF-16 as well. Other encodings than UTF-16 exist only in application-level wrappers; everything starting from the Native API (NTDLL.DLL) down to the kernel and file system uses UTF-16.

> 
> When you say
> 
>> even though they [Unicode file names] are perfectly legal
>> as far as the operating system is concerned [...]
> 
> I suppose that you have the C API in mind, and I suppose that the
> fopen() you mention is that from MSVCRT.
> 
> Which character encoding does fopen() expect?

The C standard does not specify this. In MSVCRT, fopen wraps CreateFileA, which assumes its arguments to be in Windows-1252 (or any other of the legacy codepages), and converts it to UTF-16 and then calls CreateFileW, which does the actual work.

> 
> Does the Exploder use fopen() from MSVCRT?

No, never, otherwise it wouldn't support Unicode. It uses CreateFileW, either directly or via a wrapper (e.g. _wfopen).

>  I ask because I've seen so
> many differences between the Exploder and cmd.exe, especially
> regarding file permissions and UNC paths.

This has nothing to do with encodings, it is handled by the kernel. Both the Explorer and the console use the native UTF-16 API internally.

> 
> Is it possible to open a file and avoid MSVCRT?

Of course, every file open must go through CreateFileW, it does not matter whether it is wrapped or not. (This is just like using open(2) instead of fopen(3)).

>  If yes, with which
> versions of Windows is it compatible?

All NT-based systems, including Windows 2000, XP, Vista and 7 (that is, all Windows systems that are relevant today).

> 
> I ask because you said
> 
>> Either all engines switch to a nonstandard C runtime that
>> interprets file names as UTF-8, [...]
> 
> and I'm wondering whether it's *our* mistake to rely on MSVCRT (which
> actually supports MS-DOG only), even though current versions of
> Windows provide system calls which support UTF-8 natively.  What do
> you mean with "nonstandard"?  Not shipped with recent versions of
> Windows or has to be installed explicitly?

I've only heard that there exist versions of the C runtime that are supposed to accept UTF-8 file names. But now I can't find any reference to it, sorry; the library doesn't seem to be that popular, since all discussions I've found ended in the suggestion to give the operating system what it wants (UTF-16). See for example the top answer to http://stackoverflow.com/questions/402283 for a nice explanation.

> 
> There is a system call execvpe() in MSVCRT,

Not really, there is no way to replace a running process on Windows. execvpe simply creates a new process and exits the old one.

> but some people mentioned
> CreateProcess().  Where does the latter come from?

It is the operating system function to start a new process (comparable to fork + exec). The "spawn" and "system" C runtime functions are wrappers of this function.

>  Obviously not from
> MSVCRT.  
> 
> Is there another runtime lib beyond MSVCRT?  If yes, is it still
> appropriate to rely on the old stuff?

The Microsoft Visual C runtime is the current standard C runtime and not at all old or outdated.


More information about the luatex mailing list