[XeTeX] [EXT] A LaTeX Unicode initialization desire/question/suggestion

Mon Jan 13 04:41:26 CET 2020

Phil Taylor wrote: 

>| So because JSBox is required/designed to incorporate all of XeTeX's 
>| features, it must (by definition) implement/provide \Umathcode. 

Just to be clear, JSBox can eventually incorporate all of XeTeX's features (primitives), but does not do so now. It doesn't even incorporate pdfTeX's features, but it is set up to. I'm merely adding XeTeX features as necessary to get the LaTeX macro library installed and then typeset a LaTeX document containing no Unicode at all. The problem is that somewhere in the LaTeX format initialization the ability to recognize a Unicode character (as opposed to a UTF-8 byte sequence) is equated with the assumption that it's being run under XeTeX, and that therefore at least some of XeTeX's features are there and can be relied upon at format initialization time. 

>| But could not JSbox perform (or simulate) the following : 

>| \let \Umathschar = \Umathchar % use British spelling as synonym 
>| \let \Umathchar = \undefined % inhibit "load-unicode-data.tex"'s special treatment of engines that implement \Umathchar 
>| \input load-unicode-data % since it would seem that you cannot simply skip this step 
>| \let \Umathchar = \Umathschar % restore canonical meaning of \Umathchar 

It could, but it's not my code that's issuing "\input load-unicode-data". The reading of "load-unicode-data.tex" is embedded within my version of LaTeX's own initialization code, and there's no guarantee that elsewhere in that code there isn't some dependence on \Umathchar that such a re-definition might interfere with. LaTeX's code has several tests that rely on whether |\Umathchar| is defined or not, and even in the latest versions, it is declared that \Umathchar existence is the official way to test. Indeed, the latest official comments, as David Carlisle brought to my attention in this thread, declare that \Umathchar existence testing is the current way to go in all sorts of places. 

Such negative "let's fool some other code to get something done" hacks are fragile because they render the other, affected TeX code impossible to understand when reading it. Far better and safer is an affirmative addition to the various checks already being made that facially means what it says: if Unicode character mapping data has been loaded, don't bother. 

Here is perhaps a slightly better hack: 

If it's acceptable as the very first executable line in latex.ltx (or other format source files) to test the catcode value of `{ to determine whether a format has already been loaded or not, then it should be acceptable within "load-unicode-data.tex" (or the like) to include a similar test to determine whether to proceed with the TeX parse of the Unicode data, or to bail because it's presumable that the tables are already initialized. For example, the first non-8-bit Unicode character is: 

0100;LATIN CAPITAL LETTER A WITH MACRON;Lu;0;L;0041 0304;;;;N;LATIN CAPITAL LETTER A MACRON;;;0101; 

It is safe, I think, to assume that this Unicode character will forever be classified as an uppercase letter (with a lowercase mapping value of U+0101). 

When the XeTeX engine begins running, before any TeX source code is interpreted, the engine initializes its internal |cat_code| array (all 1,114,112 slots) with the value |other_char| (12). It then does the usual classic TeX initialization to declare ASCII letters as such, etc. Later, during the LaTeX format's reading of "load-unicode-data.tex", a simple test to determine whether to continue reading the file could be made based on whether the catcode value of U+0100 is 11 (letter) or 12 (other). If it's already known as a letter, then the catcode table is not in its initial default state, and a second initialization is unnecessary. If it's still an |other_char| (12), then things need initializing for letter characters and the rest of "load-unicode-data.tex" should be executed. 

>>| Furthermore, the purpose of executing "load-unicode-data.tex" is precisely to 
>>| populate the \Umathchar table, as well as other Unicode character tables. 
>>| So these tables have to exist prior to executing the file. 

>| Well, do they, in the case of JSBox? From what you wrote in your original 
>| query, I thought that that [1] was the very thing that you were trying to avoid ... 
>| [1] "executing "load-unicode-data.tex" [in order] to populate the \Umathchar table". 
>| So specifically, does the \Umathchar table have to exist, in JSBox, at the point 
>| that "load-unicode-data.tex" is loaded ? 

I'm trying to avoid initializing these character mapping tables twice, especially when the second time (reading this file) rather inefficiently takes 30 times longer than the first, and accomplishes nothing new. 

Thanks for thinking about my questions, I appreciate it. 

Doug McKenna 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/xetex/attachments/20200112/1395c8c0/attachment.html>