[tex-hyphen] weighting hyphenation points (was: hyphenation (what else ; -))

Tue May 18 21:12:26 CEST 2010

On Mon, May 17, 2010 at 13:01, Stephan Hennig wrote:
> Am 17.05.2010 00:55, schrieb Mojca Miklavec:
>
>>>  From a readability point of view 'lava-bo' is better for me since one
>>> can
>>> guess the rest of the word (whereas you can't guess the rest of la-)
>>
>> <not-to-be-taken-seriously>
>> Oh, and yes ... I was already wondering when somebody will come up
>> with the idea to extend TeX with tolerances for preferable breaking
>> points in addition to the allowed ones :) :) :)
>> </not-to-be-taken-seriously>
>
> Incidentally, I've had a mail conversation about this with Taco and Werner a
> couple of weeks ago.  The good news is, I think Taco has this on his list.
>  Here's a sketch of the approach as I understand it (ignoring libhnj for
> now).
>
> Hyphenation points can be weighted by applying multiple pattern sets in
> parallel that have different weights attached.  That is, if a match exists
> in, e.g., a compound word pattern set, then that hyphenation point will be
> weighted higher than a regular hyphenation point.  If concurring pattern
> sets find a match, the highest weight wins.
>
> Consider these pattern sets
>
>  * regular pattern set with an attached weight of 10:
>
>      n1n a1d
>
>  * compound word pattern set with an attached weight of 20:
>
>      en1nad
>
> and the compound word "Tannennadel" (fir needle).  The regular pattern set
> has matches
>
>  Tan-nen-na-del
>
> weighting each hyphenation point equally (10 or whatever).  Compound word
> patterns find the match
>
>  Tannen-nadel
>
> weighting that match 20.  Finally, during paragraph breaking, hyphenation
> weights will be
>
>  Tan-nen-na-del
>     10  20 10
>
> Therefore breaking the word at the word compound Tannen-nadel will be
> (slightly) preferred.

Thanks for the really nice outline. I didn't mean it too seriously,
but now I'll have a "problem" that I'll have to find a list of
preferred hyphenation points somewhere, while I don't even understand
our exact rules :)

It seems that at some point we'll have to start splitting
luatex-specific patterns (with advanced features) from regular ones
(which might already be the case - I have a feeling that Hungarian
might have an improved set of patterns that could be used in luatex
and only in luatex).

Mojca