[XeTeX] A LaTeX Unicode initialization desire/question/suggestion

Doug McKenna doug at mathemaesthetics.com
Mon Jan 13 04:41:43 CET 2020

```David Carlisle wrote:

>| Note this list is for the xetex extended tex,
>| but the issues you raise are unrelated to xetex
>| but to the latex format initialisation so this
>| is not really the right list.

I checked all the lists before posting, at

<https://tug.org/mailman/listinfo>

and wasn't sure which would be best, because there is no mailing list listed there re LaTeX that seemed appropriate.  I thought that because it involved Unicode and XeTeX's primitives, it was on-topic for here.  Apologies.  What is the appropriate list?

>| I'm happy to see that you imply that the name might change,
>| I think the JS prefix would prove very confusing should you
>| distribute it under that name, as it isn't JavaScript based,
>| especially as there are JavaScript ports of the texlive stack
>| these days. Not to mention existing JavaScript apps called
>| jsbox https://apps.apple.com/sg/app/jsbox-learn-to-code/id1312014438

Agreed.  For now, I just need a word to refer to the thing.

>| Not quite sure why you say that is unlike xetex? xetex and luatex
>| expose unicode characters as single characters before TeX tokenisation
>| happens, perhaps you are thinking of the TeX macro based UTF-8 decoder
>| we use with (pdf)teX ?

Quite possible.  I've only focused on some of XeTeX's source code.

>| Note that you are looking at the older releases.
>| For testing it would be better to test against the
>| latex-dev releases, which preload more into the format
>| (again saving time by not having to read the Unicode
>| data when the expl3 package is loaded eg by fontspec).
>|
>| xelatex-dev is currently Pre release 2, but Pre release 3 is expected
>| to be released in a few days, or the first full release with expl3
>| preloaded in the format is due in the first week of February.

>| Note the latex macros will still need to read the Unicode Data files
>| to initialise structures held in TeX macros even if the lccode and
>| uccode tables were pre-initialised, so it would not be possible to
>| avoid all reading of the Unicode files.

I don't really understand this ("structures held in TeX macros").  But see below.

>| reading  (see l3unicode.dtx) that is in expl3.sty (in current xelatex
>| fomats) but will be preloaded in future releases and in the current
>| xelatex-dev release as noted above.

I tried looking at, e.g., l3unicode.dtx, and it's still using TeX (or impenetrable LaTeX3 kernel language built on top) to parse the official Unicode data files.  It's hard for me to imagine how any of that isn't at least an order of magnitude slower than scanning through a mere 20K block of bytes with a machine pointer in C, and installing into all pertinent character mapping tables every piece of information that XeTeX says it's interested in on a per character or per character range basis.  When I use the term "preloaded" I'm not talking about parsing anything inside TeX's virtual machine using the TeX language (or whatever's built on top of it).

>| It wouldn't be appropriate to have a primitive just to control the
>| behaviour of one higher level macro system but there could be tests
>| there for additional cases, as you say.

See my response regarding a \catcode test to Phil Taylor also in this thread.

>| While jsbox is just in your private development code you could also
>| simply arrange modified or empty tex files get included at this point.

I'm trying to avoid hacks that won't last.

>| A tex primitive that controls a macro set seems to be reversing the
>| natural layering, you could test for \jsboxversion (or whatever you
>| have) or test that the lccode of some character is already non zero
>| or... several other possibilities without introducing a primitive
>| here.

The point is that it *isn't* a TeX primitive.  The idea is that it would be a primitive specific only to those engines that initialize their character mapping tables (\catcode, \lccode, \uccode, etc.) when the interpreter is created/launched/whatever, before it ever executes any TeX source code as a virtual machine.  My point is that testing for the existence of \Umathcode is an inappropriate test for that condition.

>| It would be interesting to see the timings that you get with the -dev
>| formats.

Yes, I agree.  Unfortunately, my brain is complexity-bound already.

>| Are there architectural reasons preventing you from having a
>| format file, or is it simply that you hope to make loading quick
>| enough that you do not need it?

Yes, and yes.  JSBox does not depend on an internal array of integers (mem[] or whatever).  Doing so makes it essentially impossible to use a modern-day debugger to examine data structures.  Every data structure that is allocated is done so (indirectly) via malloc() or whatever the equivalent might be on some system.  This makes it harder to create a \dump format file, though not impossible.  But it wouldn't be (or need to be) compatible with anything in the official TeX world.  Regardless, my goal is to see how far one can get without needing format files.  Also, see below.

>| The pressure to load more into a
>| format is likely to increase rather than decrease, people often
>| pstricks for example.

True, but there is a fundamental difference between what I'm working toward, and what the TeX infrastructure does.  In the TeX world, every job is a single process.  Every time a TeX job is done, a process is launched, the job gets done, and the program ends.  It's the Unix/command-line way.  So the format has to be loaded (fast) on every job.  Makes perfect sense.

But when your engine is just a library linked into another program the lives for a long time, perhaps measured in days, and when the user is running multiple jobs from the same program, then there ought to be a way to load the format from its source code >once<, and have it live in the engine's memory even while job after job is executing on top, with a clean-up after each job ends.  This is, after all, completely conformant with everyday use of TeX (edit...run job...edit...run job...), not to mention every other computer language.  I'm pretty sure that I've architected my code to allow this, although it's untested for now.  One step at a time.

>| As noted above, with latex-dev releases you are still going to need
>| the unicode data files to be read using tex macros.

Are these files read more than once, and if so, why?  If not, I don't understand why I'm still going to need to read them.

>| Before making any
>| changes to the tex macros you may want to do timings with the these
>| versions. It may be that you choose to reconsider not making (the
>| equivalent of) format files, as just saving the time for setting the
>| lccodes may be a less significant proportion of the startup time.

Agreed.

>| To be in the core tex macros we would need to have the engine
>| incorporated into texlive so that it could be tested as part of our
>| test suite and continuous integration tests.

That doesn't make sense to me.  Adding a couple of lines of code to "load-unicode.data.tex" and then determining with regression tests that absolutely nothing has changed doesn't involve any third party at all.

>| possibilities for you to build something along those lines without
>| requiring any changes to the core macro files, so lack of change here
>| shouldn't be seen as a discouragement and anyway gives you more
>| flexibility with changing names etc while jsbox is being developed.

Duly noted.

>| Returning to your original question as to what constitutes a "Unicode"
>| TeX for LaTeX, we have put some data on the requirements  for extended
>| TeX features in the draft ltnews31 which will be part of next week's
>| latex-dev release, but you can see the sources now at
>|
>| Primitive Requirements:
>| https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L596
>|
>|