[XeTeX] A LaTeX Unicode initialization desire/question/suggestion

Mon Jan 13 09:11:43 CET 2020

On Mon, 13 Jan 2020 at 03:41, Doug McKenna <doug at mathemaesthetics.com> wrote:
>
> David Carlisle wrote:
>
> >| Note this list is for the xetex extended tex,
> >| but the issues you raise are unrelated to xetex
> >| but to the latex format initialisation so this
> >| is not really the right list.
>
> I checked all the lists before posting, at
>
>   <https://tug.org/mailman/listinfo>
>
> and wasn't sure which would be best, because there is no mailing list listed there re LaTeX that seemed appropriate.  I thought that because it involved Unicode and XeTeX's primitives, it was on-topic for here.  Apologies.  What is the appropriate list?

I doubt the list owners will mind the occasional non-xetex thread,
mainly it was a marker in case there were going to be more jsbox
discussions.
probably texhax (on this server) or latex-l
(https://listserv.uni-heidelberg.de/cgi-bin/wa?A0=LATEX-L)  would be
more suitable or to actually suggest code changes an issue raised at
https://github.com/latex3/unicode-data/issues
would be could as then that can be automatically referenced in any
code commit logs.

>
> >| I'm happy to see that you imply that the name might change,
> >| I think the JS prefix would prove very confusing should you
> >| distribute it under that name, as it isn't JavaScript based,
> >| especially as there are JavaScript ports of the texlive stack
> >| these days. Not to mention existing JavaScript apps called
> >| jsbox https://apps.apple.com/sg/app/jsbox-learn-to-code/id1312014438
>
> Agreed.  For now, I just need a word to refer to the thing.
>
> >| Not quite sure why you say that is unlike xetex? xetex and luatex
> >| expose unicode characters as single characters before TeX tokenisation
> >| happens, perhaps you are thinking of the TeX macro based UTF-8 decoder
> >| we use with (pdf)teX ?
>
> Quite possible.  I've only focused on some of XeTeX's source code.
>
> >| Note that you are looking at the older releases.
> >| For testing it would be better to test against the
> >| latex-dev releases, which preload more into the format
> >| (again saving time by not having to read the Unicode
> >| data when the expl3 package is loaded eg by fontspec).
> >|
> >| xelatex-dev is currently Pre release 2, but Pre release 3 is expected
> >| to be released in a few days, or the first full release with expl3
> >| preloaded in the format is due in the first week of February.
>
> >| Note the latex macros will still need to read the Unicode Data files
> >| to initialise structures held in TeX macros even if the lccode and
> >| uccode tables were pre-initialised, so it would not be possible to
> >| avoid all reading of the Unicode files.
>
> I don't really understand this ("structures held in TeX macros").  But see below.
>
> >| load-unicode-data handles some of the reading, but there is additional
> >| reading  (see l3unicode.dtx) that is in expl3.sty (in current xelatex
> >| fomats) but will be preloaded in future releases and in the current
> >| xelatex-dev release as noted above.
>
> I tried looking at, e.g., l3unicode.dtx, and it's still using TeX (or impenetrable LaTeX3 kernel language built on top) to parse the official Unicode data files.  It's hard for me to imagine how any of that isn't at least an order of magnitude slower than scanning through a mere 20K block of bytes with a machine pointer in C, and installing into all pertinent character mapping tables every piece of information that XeTeX says it's interested in on a per character or per character range basis.  When I use the term "preloaded" I'm not talking about parsing anything inside TeX's virtual machine using the TeX language (or whatever's built on top of it).

Obviously in C you can parse it quicker but the final structure isn't
a simple array of integers like the lccode table, so at the engine
level you have nowhere to put the data, the Unicode files have lots of
additional information (notably from CaseFolding.txt SpecialCasing.txt
and similar files. The information needs to be stored in
latex-specified macros so the engine can't pre-populate that other
than by the traditional way to initialise a state of macro definitions
which is a format file (or something functionally equivalent to that).

>
> >| It wouldn't be appropriate to have a primitive just to control the
> >| behaviour of one higher level macro system but there could be tests
> >| there for additional cases, as you say.
>
> See my response regarding a \catcode test to Phil Taylor also in this thread.
>
> >| While jsbox is just in your private development code you could also
> >| simply arrange modified or empty tex files get included at this point.
>
> I'm trying to avoid hacks that won't last.
>
> >| A tex primitive that controls a macro set seems to be reversing the
> >| natural layering, you could test for \jsboxversion (or whatever you
> >| have) or test that the lccode of some character is already non zero
> >| or... several other possibilities without introducing a primitive
> >| here.
>
> The point is that it *isn't* a TeX primitive.  The idea is that it would be a primitive specific only to those engines that initialize their character mapping tables (\catcode, \lccode, \uccode, etc.) when the interpreter is created/launched/whatever, before it ever executes any TeX source code as a virtual machine.  My point is that testing for the existence of \Umathcode is an inappropriate test for that condition.

That is what I meant: "engine primitive" if you prefer rather than
"tex primitive"
>
> >| It would be interesting to see the timings that you get with the -dev
> >| formats.
>
> Yes, I agree.  Unfortunately, my brain is complexity-bound already.
>
> >| Are there architectural reasons preventing you from having a
> >| format file, or is it simply that you hope to make loading quick
> >| enough that you do not need it?
>
> Yes, and yes.  JSBox does not depend on an internal array of integers (mem[] or whatever).  Doing so makes it essentially impossible to use a modern-day debugger to examine data structures.  Every data structure that is allocated is done so (indirectly) via malloc() or whatever the equivalent might be on some system.  This makes it harder to create a \dump format file, though not impossible.  But it wouldn't be (or need to be) compatible with anything in the official TeX world.  Regardless, my goal is to see how far one can get without needing format files.  Also, see below.

Yes the format of the actual data in any "format file" needn't be the
same as classic tex, but some way of initialising the state of the
macro definitions is going to be useful I suspect.
>
> >| The pressure to load more into a
> >| format is likely to increase rather than decrease, people often
> >| routinely make custom formats preloading large packages like tikz or
> >| pstricks for example.
>
> True, but there is a fundamental difference between what I'm working toward, and what the TeX infrastructure does.  In the TeX world, every job is a single process.  Every time a TeX job is done, a process is launched, the job gets done, and the program ends.  It's the Unix/command-line way.  So the format has to be loaded (fast) on every job.  Makes perfect sense.
>
> But when your engine is just a library linked into another program the lives for a long time, perhaps measured in days, and when the user is running multiple jobs from the same program, then there ought to be a way to load the format from its source code >once<, and have it live in the engine's memory even while job after job is executing on top, with a clean-up after each job ends.  This is, after all, completely conformant with everyday use of TeX (edit...run job...edit...run job...), not to mention every other computer language.  I'm pretty sure that I've architected my code to allow this, although it's untested for now.  One step at a time.
>
> >| As noted above, with latex-dev releases you are still going to need
> >| the unicode data files to be read using tex macros.
>
> Are these files read more than once, and if so, why?  If not, I don't understand why I'm still going to need to read them.
Read once but as noted above the information needs to be a structured
and the structures are defined in tex macros as that's all there is in
tex...

>
> >| Before making any
> >| changes to the tex macros you may want to do timings with the these
> >| versions. It may be that you choose to reconsider not making (the
> >| equivalent of) format files, as just saving the time for setting the
> >| lccodes may be a less significant proportion of the startup time.
>
> Agreed.
>
> >| To be in the core tex macros we would need to have the engine
> >| incorporated into texlive so that it could be tested as part of our
> >| test suite and continuous integration tests.
>
> That doesn't make sense to me.  Adding a couple of lines of code to "load-unicode.data.tex" and then determining with regression tests that absolutely nothing has changed doesn't involve any third party at all.

I can see from your side it's a minor irritant, sorry:-)
But look at it from ours, there are in fact dozens of tex variants
that are under development at various places, we need some objective
criterion for which get "blessed" with core latex support rather than
requiring to patch stuff to cover over differences. Requiring a system
that integrates with our build and test infrastructure is fair and
actually useful to us not only acting as a filter.

As you noticed there is code in the latex base sources to test for
luatex and xetex, but that was only added in 2015, something like a
decade after the first releases of xetex appeared, we tend to be
cautious about these things.. Prior to that xelatex.ini and
lualatex.ini  had more or less custom patches and redefinitions
supplied by third parties along the lines Phil suggested earlier.

>
> >| However as already discussed in this thread there are several
> >| possibilities for you to build something along those lines without
> >| requiring any changes to the core macro files, so lack of change here
> >| shouldn't be seen as a discouragement and anyway gives you more
> >| flexibility with changing names etc while jsbox is being developed.
>
> Duly noted.
>
> >| Returning to your original question as to what constitutes a "Unicode"
> >| TeX for LaTeX, we have put some data on the requirements  for extended
> >| TeX features in the draft ltnews31 which will be part of next week's
> >| latex-dev release, but you can see the sources now at
> >|
> >| Primitive Requirements:
> >| https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L596
> >|
> >| see also
> >|
> >| Improved load-times for expl3:
> >| https://github.com/latex3/latex2e/blob/develop/base/doc/ltnews31.tex#L169
> >|
> >| on the additional items preloaded in the format.
>
> Many thanks!  This is very helpful.
>
>
> Doug McKenna
> Mathemaesthetics, Inc.

David