[XeTeX] A LaTeX Unicode initialization desire/question/suggestion

Mon Jan 13 09:42:40 CET 2020

On 13/01/2020 03:41, Doug McKenna wrote:
>> | load-unicode-data handles some of the reading, but there is additional
>> | reading  (see l3unicode.dtx) that is in expl3.sty (in current xelatex
>> | fomats) but will be preloaded in future releases and in the current
>> | xelatex-dev release as noted above.
> 
> I tried looking at, e.g., l3unicode.dtx, and it's still using TeX (or impenetrable LaTeX3 kernel language built on top) to parse the official Unicode data files.

For performance reasons, we had to make that part a bit more complex 
than it was originally: at present, it's run during every LuaTeX/XeTeX 
run, and that is a bit of an issue. It's one of the reasons we want to 
pre-load expl3 and dump it into the format.

> It's hard for me to imagine how any of that isn't at least an order of magnitude slower than scanning through a mere 20K block of bytes with a machine pointer in C, and installing into all pertinent character mapping tables every piece of information that XeTeX says it's interested in on a per character or per character range basis.  When I use the term "preloaded" I'm not talking about parsing anything inside TeX's virtual machine using the TeX language (or whatever's built on top of it).

It's not absolutely as fast as it can be in TeX, but it's close. (For 
LuaTeX, a Lua reader would of course be possible and likely faster, but 
then we'd have two code paths to worry about.)

David's point was that the Unicode data is not needed only for the TeX 
internal tables for \uccode, \lccode, \catcode (possibly others). It's 
also needed to cover other Unicode concepts that TeX doesn't know, and 
so have to be coded at the macro level. For example, Unicode case 
changing is not a one-one operation. For the majority of codepoints, one 
can use the TeX \lccode/\uccode values (and avoid needing to hold them 
in TeX macros). Most of this information is in the relatively small file 
SpecialCasing.txt, but there is also the information one needs from 
UnicodeData.txt to cover titlecasing. We did consider 'pre-extracting' 
that data, but it made relatively little difference during a normal TeX 
run, and leaves open the risk of mismatched files. A 'bigger' data set 
required is NFD mappings: they are needed to handle for example Greek 
case changing. TeX doesn't know about NFD, so again one needs some data, 
which again comes from UnicodeData.txt, and again needs to be stored 
somewhere that's not 'pre-defined'.

>> | A tex primitive that controls a macro set seems to be reversing the
>> | natural layering, you could test for \jsboxversion (or whatever you
>> | have) or test that the lccode of some character is already non zero
>> | or... several other possibilities without introducing a primitive
>> | here.
> 
> The point is that it *isn't* a TeX primitive.  The idea is that it would be a primitive specific only to those engines that initialize their character mapping tables (\catcode, \lccode, \uccode, etc.) when the interpreter is created/launched/whatever, before it ever executes any TeX source code as a virtual machine.  My point is that testing for the existence of \Umathcode is an inappropriate test for that condition.

Er, it's a primitive, no? Or would be set up a macro that was 
pre-defined by the engine?

> But when your engine is just a library linked into another program the lives for a long time, perhaps measured in days, and when the user is running multiple jobs from the same program, then there ought to be a way to load the format from its source code >once<, and have it live in the engine's memory even while job after job is executing on top, with a clean-up after each job ends.  This is, after all, completely conformant with everyday use of TeX (edit...run job...edit...run job...), not to mention every other computer language.  I'm pretty sure that I've architected my code to allow this, although it's untested for now.  One step at a time.

Years ago, Jonathan Fine wrote a TeX daemon that could stay running, 
relying on the fact that DVI files don't need to be closed (unlike PDF 
ones). That requires avoiding \end, and he could only support plain TeX 
as that means disabling \csname, so no environments. I assume you are 
not thinking of a 'permanently running TeX job' in that sense?

>> | As noted above, with latex-dev releases you are still going to need
>> | the unicode data files to be read using tex macros.
> 
> Are these files read more than once, and if so, why?  If not, I don't understand why I'm still going to need to read them.

l3unicode reads each one once, as noted above to populate macro data 
storage. Presumably you are not worried about LuaTeX, so don't have to 
think about font loaders (which also need Unicode info, and which is 
handled by LuaTeX in Lua code).

>> | To be in the core tex macros we would need to have the engine
>> | incorporated into texlive so that it could be tested as part of our
>> | test suite and continuous integration tests.
> 
> That doesn't make sense to me.  Adding a couple of lines of code to "load-unicode.data.tex" and then determining with regression tests that absolutely nothing has changed doesn't involve any third party at all.

David was I think talking about support by LaTeX as a whole, not the 
rather restricted load-unicode-data file. The LaTeX kernel has about 350 
test files, and those are run for pdfTeX, XeTeX and LuaTeX. Adding pTeX 
and upTeX would be good, but there are issues there as those engines are 
used with additional macro code out-of-the-box. To be properly 
supported, we'd need to test JSBox too. (For LaTeX3 code, we do test 
pTeX and upTeX as well as the core western engines: it's easier for 
various reasons there.)

Joseph