[XeTeX] A LaTeX Unicode initialization desire/question/suggestion
doug at mathemaesthetics.com
Sat Jan 11 01:31:58 CET 2020
What is the possibility of making a slight enhancement to how the Unicode LaTeX format is created with respect to "Unicode-aware" engines (an unfortunately somewhat ill-defined term)? Here's the situation:
My TeX/e-TeX language interpreter, currently called JSBox, is implemented as a simple C library, so it can be incorporated into any other software (for instance, I've recently created a Java class wrapper around it, hoping to soon use it in an Android eBook/app).
JSBox is entirely Unicode-based internally; every TeX algorithm and data structure has been enhanced to treat a "character" as a 21-bit quantity, rather than an 8-bit byte. Unlike XeTeX, JSBox does not use TeX language machinery to decode incoming UTF-8 byte sequences. That happens in JSBox at a lower ("transport") level where all the possible UTF streams (or older 8-bit encodings) are converted to 21-bit Unicode characters before the language scanner ever sees anything.
One of my goals, in the service of simplicity, is not to rely on dumped format files. This means that prior to typesetting any document, JSBox must initialize itself by reading in the source code for whatever format is desired.
I've published an eBook/app for iOS (called "Hilbert Curves") that uses the JSBox library to typeset its simulated pages. At app launch, it reads in the macros of "plain.tex" and of the "opmac.tex" markup macro library, and other files, before executing the TeX source code for the 160-page book. All of this takes a negligible amount of time from the user's perspective. One of my goals now is to do something similar for a LaTeX document.
But LaTeX's source code is of course several orders of magnitude more complex and longer than is plain's and opmac's. I've been working on initializing a LaTeX typesetting job, simply by reading in "latex.ini". For what it's worth, JSBox (configured to record statistics) reports that this parse of "latex.ini" results in:
7863 macro definitions or re-definitions
(basically a count of all calls to \def, \edef, etc.).
According to Joseph Wright, who recently answered a question of mine posed here, it takes somewhere between 2 and 3 seconds on his computer to initialize the LaTeX format for the Unicode-aware XeTeX engine. In the TeX world, it doesn't really matter how long it takes, since it is the post-parse memory state that is saved into the binary format file that is distributed with TeX engines, to be read in later, presumably much faster, when a user starts a typesetting job.
JSBox doesn't rely on any of that. On my 2.2GHz MacBook Pro laptop, JSBox takes about 1.25 seconds to read "latex.ini" and all its subordinate files (including some 85 different language hyphenation database files). But it turns out that 60% of that time is due to executing the file "load-unicode-data.tex". That file uses TeX macros to read and parse several large Unicode Consortium files so as to set up various character mapping tables (catcodes, upper- and lowercase characters, math characters, etc.). The TeX macros that do this parsing are clever and concise, but they are way not efficient. I've traced them, and the situation might be described as "algorithmic churn," parsing and re-parsing and re-re-parsing lines.
In contradistinction, JSBox depends on separately preprocessing the various Unicode Consortium data files (about 2MB of total text data) into a 100-times-smaller (20K) binary file that can be used at interpreter initialization time. Parsing this binary file using the interpreter's own C code takes only about 25 ms (1/40th of a second) to initialize JSBox's various internal Unicode character mapping tables to their non-default values for all the Unicode characters (code points). That's about 30 times faster than what happens in "load-unicode-data.tex".
So ... It would be really great if there were a way to make the reading of "load-unicode-data.tex" conditional in some way, so that it works exactly the same way for XeTeX when building the Unicode LaTeX format, but allows other TeX language interpreters (such as JSBox) to bypass this inefficient parse of Unicode character files in favor of whatever the interpreter has otherwise already done.
The solution, I think, is pretty easy.
"load-unicode-data.tex" already tests for certain compatibility conditions and short-circuits itself accordingly. Its first executable lines are:
% The data can only be loaded by Unicode engines. Currently this is limited to
% XeTeX and LuaTeX, both of which define \Umathcode.
% Just in case, check for the e-TeX extensions.
But the first of these tests is no longer a good test, because JSBox is a Unicode/eTeX engine that does implement \Umathcode but has no need nor desire to execute this file because JSBox's mapping tables have *already* been initialized before any TeX code is ever pushed onto its execution stack, the same as classic TeX does for simple one-byte characters.
A solution is a dedicated, read-only "last_item" integer value, called, e.g., \Unicodedataloaded, whose existence or value prevents "load-unicode-data.tex" (or similar) from being executed (further). The primitive doesn't even have to have a value, the fact that it exists can be sufficient to test against. So adding the following lines after the eTeX test at the start of "load-unicode-data.tex" would solve the problem, not just for JSBox, but for any other future Unicode TeX engine faced with a similar situation.
% Give any Unicode engine the ability to initialize its mapping
% tables in its own way instead of relying on this file, as long
% as it implements a primitive named \Unicodedataloaded.
For current XeTeX LaTeX format initialization, there should be no change to how things are built.
I implemented this primitive today in JSBox (as a read-only value of 1), and made the above change in my local copy of "load-unicode-data.tex". Executing "latex.ini" now takes about .5 second, which is a considerable improvement over 1.25 seconds, certainly now within the bounds of what might be an acceptable user experience typesetting a Unicode LaTeX document after reading the format's source code.
Are there any downsides to this minor change that I'm missing? Is there a better name for the primitive? What can I do to encourage that the above test be officially added to "load-unicode-data.tex"?
More information about the XeTeX