[XeTeX] not enough \XeTeXcharclass registers

Mon Feb 1 11:53:50 CET 2016

On 1/2/16 10:25, David Carlisle wrote:
> Thanks for the test sources,
>
> It all seems to work for me (texlive 2015/cygwin 64 build), but..
>
> I do wonder if this change is going in the right direction.
>
> The main problem with the char classes is not the overall number, in
> fact since the important thing as far as specifying code is the boundary
> between different classes rather than the classes themselves, there are
> now around 300 million such boundaries that could be specified, which
> seems more than enough!
>
> The main problem is that each character can only be in one class which
> means that it is very hard to use these for any generic code. If you
> have already classified characters by (say) line breaking properties and
> then another package wants to classify by unicode block, or by default
> writing direction, then the only way to handle that is to enumerate all
> the intersecting properties and assign a a unique character class to
> each intersection, this leads to a combinatorial explosion in the number of
> boundary tokens that need to be specified. Where you may have had a
> single specification for the boundary between LTR and RTL if you also
> want to classify each unicode block you need  separate classes for LTR
> and RTL characters in each block and then need to specify the same
> boundary tokens for all the possible changes of LTR in one block
> followed by RTL in another.
>
> That limitation of course has always been there, but increasing the
> number of classes available highlights it more strongly.

You're right, of course; this is a limitation of the concept as 
currently implemented.

In practice, I suppose I don't expect there to be all that many "generic 
purposes" for which intercharclass is really a useful tool. For example, 
it's hard to see how it could work well for bidi issues, because of the 
problem of resolving neutral characters -- especially run-initial neutrals.

>
> Would it be impossibly difficult to extend the concept so that a
> character takes a list of character classes so that you can classify
> characters in more than one way without needing impossibly many
> character classes to do that?

There would be two aspects to this: first, extending the character class 
storage so as to allow a list rather than a single number. Currently, 
it's stashed in the upper part of the word where sfcode already lives, 
making the implementation very simple and cheap.

And second, checking for the existence of a token list for the current 
boundary would become significantly more expensive. Currently, we just 
combine the two classes at the boundary to get a single 32-bit number, 
and do a simple lookup (in a sparse array) to see if there's anything 
defined. With class lists, we'd need to do this for each of the classes 
in the two lists -- i.e. m * n sparse-array lookups. Or perhaps go at it 
from the other direction: iterate over a list of defined transitions, 
and check whether each of them applies.

Oh, and if there are multiple matches at a given boundary, what happens? 
Using an imaginary extension to support lists:

   \XeTeXintercharclasses `A = { 1, 2 }
   \XeTeXintercharclasses `B = { 3, 4 }

   \XeTeXinterchartoks 1 3 = { foo }
   \XeTeXinterchartoks 1 4 = { bar }
   \XeTeXinterchartoks 2 3 = { xyzzy }
   \XeTeXinterchartoks 2 4 = { plugh }

What happens at the boundary in "AB"? Should it depend on the numerical 
values of the classes, or the order in which the transitions were 
specified, or what?

(I'm not saying the idea is a bad one; I can imagine it might be quite 
useful. But I can also imagine it getting a bit hairy......)

JK