[lltx] pattern loading in luatex
mailing_list at arcor.de
Mon Mar 7 13:56:45 CET 2011
schrieb Taco Hoekwater:
> On 03/06/2011 02:24 PM, Stephan Hennig wrote:
>> Let me put it this way: In LuaTeX, the top-level data structure to
>> organize the patterns of a language is still a trie, with additional
>> data attached (fall-back transitions) that make the trie behave like a
>> finite state automaton. Would that be wrong?
> It depends on how you see the word 'trie': there is nowhere in luatex
> a data structure that looks like an actual trie. The algorithm of course
> uses a prefix test as the first step, but it checks for a key in a
> hash instead of walking a tree.
OK, I was assuming word decomposition still worked by walking a tree and
the hashed keys are the (multi-byte) Unicode characters triggering a
transition at a certain state.
> I am not too good in explaining this, but you can look at
> luatexdir/lang/hyphen.w for the actual details.
I have already looked at the libhnj sources, but honestly, I'm not good
at reading code and even worse in languages I rarely use myself. But
nevertheless, I'm interested in knowing a bit more about the actual
algorithm used in LuaTeX. (I have coded a trie based hyphenation
algorithm in Lua.)
By "instead of walking a trie", do you mean LuaTeX uses a trivial
approach where the existence of all patterns potentially matching a word
are checked individually? As an example, for a word "abc" iterations
are as follows:
iteration new character existing patterns to check
1 a a
2 b b
3 c c
That would be a quadratic order word decomposition algorithm instead of
the linear trie based algorithm. I don't think word decomposition were
a bottleneck even with such trivial approach, but still ...
After another thought: in the trivial approach there's no need for
fall-back transitions. So I wonder what fall-back transitions are good
for, if you're not walking a tree?
>> BTW, am I correctly assuming that the reason for switching from a packed
>> array data structure to a hash is that the former only works as long as
>> the keys (characters) are from a finite range. Which is the case for a
>> 7 or 8 bit character encoding. But with Unicode characters this is
>> (practically) not the case?
> Mostly. A secondary reason is that the original routines converted the
> trie into a 'compressed trie', after which point it became immutable.
> We did not like that, it is quite handy to be able to add new \patterns
> (for other languages) on the fly, and changing the existing code would
> have been quite a lot of work.
More information about the lualatex-dev