[XeTeX] [EXT] A LaTeX Unicode initialization desire/question/suggestion

Mon Jan 13 09:21:20 CET 2020

On 13/01/2020 03:41, Doug McKenna wrote:
> Phil Taylor wrote:
> 
>> | So because JSBox is required/designed to incorporate all of XeTeX's
>> | features, it must (by definition) implement/provide \Umathcode.
> 
> Just to be clear, JSBox can eventually incorporate all of XeTeX's features (primitives), but does not do so now. It doesn't even incorporate pdfTeX's features, but it is set up to. I'm merely adding XeTeX features as necessary to get the LaTeX macro library installed and then typeset a LaTeX document containing no Unicode at all. The problem is that somewhere in the LaTeX format initialization the ability to recognize a Unicode character (as opposed to a UTF-8 byte sequence) is equated with the assumption that it's being run under XeTeX, and that therefore at least some of XeTeX's features are there and can be relied upon at format initialization time.

At present, there are two engines that implement \Umathcode, etc., 'in 
the wild', XeTeX and LuaTeX, and they have (over time) come to an agreed 
position on what core features are available at the macro level. (For 
example, originally XeTeX called it's new primitives \XeTeX... but they 
got renamed to \U... to match LuaTeX.)

They have quite a lot of differences too, but a core subset of features 
is available with both, and that comes about as they offer \Umathcode. 
Almost all of the tests in LaTeX look for the relevant primitive, so for 
example when we want \Uchar we look for it. However, there are as you 
note a few places where finding \Umathcode is by far the easiest marker.

It's quite possible to add additional tests to the core code, provided 
there is a spec or at least some notes on what's available. (For 
example, (u)pTeX for a long time had no docs in English, so things were 
tricky. But there is now a basic manual there to allow those of us who 
do not know Japanese to offer at least some basic support.)

>> | But could not JSbox perform (or simulate) the following :
> 
>> | \let \Umathschar = \Umathchar % use British spelling as synonym
>> | \let \Umathchar = \undefined % inhibit "load-unicode-data.tex"'s special treatment of engines that implement \Umathchar
>> | \input load-unicode-data % since it would seem that you cannot simply skip this step
>> | \let \Umathchar = \Umathschar % restore canonical meaning of \Umathchar
> 
> It could, but it's not my code that's issuing "\input load-unicode-data". The reading of "load-unicode-data.tex" is embedded within my version of LaTeX's own initialization code, and there's no guarantee that elsewhere in that code there isn't some dependence on \Umathchar that such a re-definition might interfere with. LaTeX's code has several tests that rely on whether |\Umathchar| is defined or not, and even in the latest versions, it is declared that \Umathchar existence is the official way to test. Indeed, the latest official comments, as David Carlisle brought to my attention in this thread, declare that \Umathchar existence testing is the current way to go in all sorts of places.

I think you mean \Umathcode :)

Each place that uses Unicode features does test for this primitive; if 
it exists, we have to-date been able to assume a few additional 
primitives are also available (e-TeX, \Uchar, \Umathchardef) but mainly 
tells us that we can allocate \lccode and \uccode beyond 255.

> Here is perhaps a slightly better hack:
> 
> If it's acceptable as the very first executable line in latex.ltx (or other format source files) to test the catcode value of `{ to determine whether a format has already been loaded or not, then it should be acceptable within "load-unicode-data.tex" (or the like) to include a similar test to determine whether to proceed with the TeX parse of the Unicode data, or to bail because it's presumable that the tables are already initialized. For example, the first non-8-bit Unicode character is:
> 
> 0100;LATIN CAPITAL LETTER A WITH MACRON;Lu;0;L;0041 0304;;;;N;LATIN CAPITAL LETTER A MACRON;;;0101;
> 
> It is safe, I think, to assume that this Unicode character will forever be classified as an uppercase letter (with a lowercase mapping value of U+0101).

The test at the start of latex.ltx is about making sure we are in IniTeX 
mode: I'm not sure I'd choose to do that today, but the test is 
long-standing. For load-unicode-data, the idea was partly that there was 
really no issue about checking: unlike formats, that might have hidden 
stuff, here all we are trying to do is get to a known position. That 
links to the second reason I'm slightly wary of a test. As-written, 
load-unicode-data ensures that the \lccode, \uccode and \catcode tables 
are in a state *known to the macro layer*. I know it's slightly strange 
to you, but as a macro programmer I can't 'know' what different engine 
devs might do/change, and I certainly don't know exactly what version of 
UnicodeData.txt you are working from. By doing initialisation without 
checking, I can be sure that we are on a known Unicode version.

To be honest, that's all a minor concern: it's very much more that there 
was no need to worry about a test. It would be trivial to add one, not 
least since the Unicode Consortium have a clear position on stability.

> I'm trying to avoid initializing these character mapping tables twice, especially when the second time (reading this file) rather inefficiently takes 30 times longer than the first, and accomplishes nothing new.

Like I said, from a macro programmer POV it accomplishes 'the codes are 
in a known state I control', though practically that's not a major 
thing. (If you were using a Unicode version different from the one 
XeTeX/LuaTeX use, it would presumably impact on a rather limited subset 
of chars.)

Joseph