[lltx] pattern loading in luatex

Stephan Hennig mailing_list at arcor.de
Mon Mar 7 13:56:45 CET 2011

schrieb Taco Hoekwater:
> On 03/06/2011 02:24 PM, Stephan Hennig wrote:
>> Let me put it this way:  In LuaTeX, the top-level data structure to
>> organize the patterns of a language is still a trie, with additional
>> data attached (fall-back transitions) that make the trie behave like a
>> finite state automaton.  Would that be wrong?
> It depends on how you see the word 'trie': there is nowhere in luatex
> a data structure that looks like an actual trie. The algorithm of course
> uses a prefix test as the first step, but it checks for a key in a
> hash instead of walking a tree.

OK, I was assuming word decomposition still worked by walking a tree and
the hashed keys are the (multi-byte) Unicode characters triggering a
transition at a certain state.

> I am not too good in explaining this, but you can look at
> luatexdir/lang/hyphen.w for the actual details.

I have already looked at the libhnj sources, but honestly, I'm not good
at reading code and even worse in languages I rarely use myself.  But
nevertheless, I'm interested in knowing a bit more about the actual
algorithm used in LuaTeX.  (I have coded a trie based hyphenation
algorithm in Lua.)

By "instead of walking a trie", do you mean LuaTeX uses a trivial
approach where the existence of all patterns potentially matching a word
are checked individually?  As an example, for a word "abc" iterations
are as follows:

  iteration  new character  existing patterns to check

    1          a              a

    2          b              b

    3          c              c

That would be a quadratic order word decomposition algorithm instead of
the linear trie based algorithm.  I don't think word decomposition were
a bottleneck even with such trivial approach, but still ...

After another thought:  in the trivial approach there's no need for
fall-back transitions.  So I wonder what fall-back transitions are good
for, if you're not walking a tree?

>> BTW, am I correctly assuming that the reason for switching from a packed
>> array data structure to a hash is that the former only works as long as
>> the keys (characters) are from a finite range.  Which is the case for a
>> 7 or 8 bit character encoding.  But with Unicode characters this is
>> (practically) not the case?
> Mostly. A secondary reason is that the original routines converted the
> trie into a 'compressed trie', after which point it became immutable.
> We did not like that, it is quite handy to be able to add new \patterns
> (for other languages) on the fly, and changing the existing code would
> have been quite a lot of work.


Best regards,
Stephan Hennig

More information about the lualatex-dev mailing list