[XeTeX] A LaTeX Unicode initialization desire/question/suggestion

Sun Jan 12 17:41:49 CET 2020

On Sat, 11 Jan 2020 at 00:33, Doug McKenna <doug at mathemaesthetics.com> wrote:
>
> What is the possibility of making a slight enhancement to how the Unicode LaTeX format is created with respect to "Unicode-aware" engines (an unfortunately somewhat ill-defined term)?  Here's the situation:

Note this list is for the xetex  extended tex, but the issues you
raise are unrelated to xetex but to the latex format initialisation so
this is not really the right list.

>
> My TeX/e-TeX language interpreter, currently called JSBox,

I'm happy to see that you imply that the name might change, I think
the JS prefix would prove very confusing should you distribute it
under that name, as it isn't JavaScript based, especially as there are
JavaScript ports of the texlive stack these days. Not to mention
existing JavaScript apps called jsbox
https://apps.apple.com/sg/app/jsbox-learn-to-code/id1312014438

> is implemented as a simple C library, so it can be incorporated into any other software (for instance, I've recently created a Java class wrapper around it, hoping to soon use it in an Android eBook/app).
>
> JSBox is entirely Unicode-based internally; every TeX algorithm and data structure has been enhanced to treat a "character" as a 21-bit quantity, rather than an 8-bit byte.  Unlike XeTeX, JSBox does not use TeX language machinery to decode incoming UTF-8 byte sequences.  That happens in JSBox at a lower ("transport") level where all the possible UTF streams (or older 8-bit encodings) are converted to 21-bit Unicode characters before the language scanner ever sees anything.

Not quite sure why you say that is unlike xetex? xetex and luatex
expose unicode characters as single characters before TeX tokenisation
happens, perhaps you are thinking of the TeX macro based UTF-8 decoder
we use with (pdf)teX ?

>
> One of my goals, in the service of simplicity, is not to rely on dumped format files.  This means that prior to typesetting any document, JSBox must initialize itself by reading in the source code for whatever format is desired.
>
> I've published an eBook/app for iOS (called "Hilbert Curves") that uses the JSBox library to typeset its simulated pages.  At app launch, it reads in the macros of "plain.tex" and of the "opmac.tex" markup macro library, and other files, before executing the TeX source code for the 160-page book.  All of this takes a negligible amount of time from the user's perspective.  One of my goals now is to do something similar for a LaTeX document.
>
> But LaTeX's source code is of course several orders of magnitude more complex and longer than is plain's and opmac's.  I've been working on initializing a LaTeX typesetting job, simply by reading in "latex.ini".  For what it's worth, JSBox (configured to record statistics) reports that this parse of "latex.ini" results in:
>
> 7863 macro definitions or re-definitions
>
> (basically a count of all calls to \def, \edef, etc.).

Note that you are looking at the older releases. For testing it would
be better to test against the latex-dev releases, which preload more
into the format (again saving time by not having to read the Unicode
data when  the expl3 package  is loaded eg by fontspec).

xelatex-dev is currently Pre release 2, but Pre release 3 is expected
to be released in a few days, or the first full release with expl3
preloaded in the format is due in the first week of February.

>
> According to Joseph Wright, who recently answered a question of mine posed here, it takes somewhere between 2 and 3 seconds on his computer to initialize the LaTeX format for the Unicode-aware XeTeX engine.  In the TeX world, it doesn't really matter how long it takes, since it is the post-parse memory state that is saved into the binary format file that is distributed with TeX engines, to be read in later, presumably much faster, when a user starts a typesetting job.
>
> JSBox doesn't rely on any of that.  On my 2.2GHz MacBook Pro laptop, JSBox takes about 1.25 seconds to read "latex.ini" and all its subordinate files (including some 85 different language hyphenation database files).  But it turns out that 60% of that time is due to executing the file "load-unicode-data.tex".  That file uses TeX macros to read and parse several large Unicode Consortium files so as to set up various character mapping tables (catcodes, upper- and lowercase characters, math characters, etc.).  The TeX macros that do this parsing are clever and concise, but they are way not efficient.  I've traced them, and the situation might be described as "algorithmic churn," parsing and re-parsing and re-re-parsing lines.
>
> In contradistinction, JSBox depends on separately preprocessing the various Unicode Consortium data files (about 2MB of total text data) into a 100-times-smaller (20K) binary file that can be used at interpreter initialization time.  Parsing this binary file using the interpreter's own C code takes only about 25 ms (1/40th of a second) to initialize JSBox's various internal Unicode character mapping tables to their non-default values for all the Unicode characters (code points).  That's about 30 times faster than what happens in "load-unicode-data.tex".

Note the latex macros will still need to read the Unicode Data files
to initialise structures held in TeX macros even if the lccode and
uccode tables were pre-initialised, so it would not be possible to
avoid all reading of the Unicode files.

>
> So ... It would be really great if there were a way to make the reading of "load-unicode-data.tex" conditional in some way, so that it works exactly the same way for XeTeX when building the Unicode LaTeX format, but allows other TeX language interpreters (such as JSBox) to bypass this inefficient parse of Unicode character files in favor of whatever the interpreter has otherwise already done.

load-unicode-data handles some of the reading, but there is additional
reading  (see l3unicode.dtx) that is in expl3.sty (in current xelatex
fomats) but will be preloaded in future releases and in the current
xelatex-dev release as noted above.

>
> The solution, I think, is pretty easy.
>
> "load-unicode-data.tex" already tests for certain compatibility conditions and short-circuits itself accordingly.  Its first executable lines are:
>
> % The data can only be loaded by Unicode engines. Currently this is limited to
> % XeTeX and LuaTeX, both of which define \Umathcode.
> \ifx\Umathcode\undefined
>   \expandafter\endinput
> \fi
> % Just in case, check for the e-TeX extensions.
> \ifx\eTeXversion\undefined
>   \expandafter\endinput
> \fi
>
> But the first of these tests is no longer a good test, because JSBox is a Unicode/eTeX engine that does implement \Umathcode but has no need nor desire to execute this file because JSBox's mapping tables have *already* been initialized before any TeX code is ever pushed onto its execution stack, the same as classic TeX does for simple one-byte characters.
>
> A solution is a dedicated, read-only "last_item" integer value, called, e.g., \Unicodedataloaded, whose existence or value prevents "load-unicode-data.tex" (or similar) from being executed (further).  The primitive doesn't even have to have a value, the fact that it exists can be sufficient to test against.  So adding the following lines after the eTeX test at the start of "load-unicode-data.tex" would solve the problem, not just for JSBox, but for any other future Unicode TeX engine faced with a similar situation.

It wouldn't be appropriate to have a primitive just to control the
behaviour of one higher level macro system but there could be tests
there for additional cases, as you say.
While jsbox is just in your private development code you could also
simply arrange modified or empty tex files get included at this point.

>
> % Give any Unicode engine the ability to initialize its mapping
> % tables in its own way instead of relying on this file, as long
> % as it implements a primitive named \Unicodedataloaded.
> \ifdefined\Unicodedataloaded
>   \expandafter\endinput
> \fi
>
> For current XeTeX LaTeX format initialization, there should be no change to how things are built.
>
> I implemented this primitive today in JSBox (as a read-only value of 1), and made the above change in my local copy of "load-unicode-data.tex".

A tex primitive that controls a macro set seems to be reversing the
natural layering, you could test for \jsboxversion (or whatever you
have) or test that the lccode of some character is already non zero
or... several other possibilities without introducing a primitive
here.

>  Executing "latex.ini" now takes about .5 second, which is a considerable improvement over 1.25 seconds, certainly now within the bounds of what might be an acceptable user experience typesetting a Unicode LaTeX document after reading the format's source code.

It would be interesting to see the timings that you get with the -dev
formats. Are there architectural reasons preventing you from having a
format file, or is it simply that you hope to make loading quick
enough that you do not need it? The pressure to load more into a
format is likely to increase rather than decrease, people often
routinely make custom formats preloading large packages like tikz or
pstricks for example.

As noted above, with latex-dev releases you are still going to need
the unicode data files to be read using tex macros. Before making any
changes to the tex macros you may want to do timings with the these
versions. It may be that you choose to reconsider not making (the
equivalent of) format files, as just saving the time for setting the
lccodes may be a less significant proportion of the startup time.

>
> Are there any downsides to this minor change that I'm missing?  Is there a better name for the primitive?  What can I do to encourage that the above test be officially added to "load-unicode-data.tex"?

To be in the core tex macros we would need to have the engine
incorporated into texlive so that it could be tested as part of our
test suite and continuous integration tests.
However as already discussed in this thread there are several
possibilities for you to build something along those lines without
requiring any changes to the core macro files, so lack of change here
shouldn't be seen as a discouragement and anyway gives you more
flexibility with changing names etc while jsbox is being developed.

Returning to your original question as to what constitutes a "Unicode"
TeX for LaTeX, we have put some data on the requirements  for extended
TeX features in the draft ltnews31 which will be part of next week's
latex-dev release, but you can see the sources now at

Primitive Requirements:
https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L596

see also

Improved load-times for expl3:
https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L169

on the additional items preloaded in the format.

>
>
> Doug McKenna
> Mathemaesthetics, Inc.

David Carlisle
for the LaTeX3 Project