[tex-hyphen] weighting hyphenation points

Stephan Hennig mailing_list at arcor.de
Thu May 20 11:34:17 CEST 2010


Am 19.05.2010 21:55, schrieb Stephan Hennig:

>     * a regular pattern set containing all valid hyphenations,
>     * a compound-word pattern set, that matches only word compounds,
>     * an undesirable pattern set, that recognizes valid, but undesirable
>       hyphenations

Realistically, I can think of five different pattern sets for 
hyphenation of the German language (in the order of decreasing weight):

   A) word compounds
        - hyphenation is much preferred
        - weight 20
        - examples: Text-illustration, Tal-entwässerung

   B) affix hyphenation
        - still appreciated, Knuth had experimented with this before
          Liang developed the dictionary algorithm
        - weight 15
        - examples: Textillustra-tion, Talent-wässe-rung

   C) all valid hyphenations
        - these correspond to the current patterns
        - weight 10
        - -examples: Text-il-lus-tra-ti-on, Tal-ent-wäs-se-rung

   D) undesirable hyphenations
        - hyphenations near a word compound or word boundary
        - weight 5
        - examples: Textil-lustrati-on

   E) sense distorting
        - to be suppressed by all means
        - weight: 1 or zero
        - examples: Talent-wässerung, Textil-lustration

I hope the examples in case E are understandable even for non-Germans.
Talentwässerung (valley drainage) has nothing to do with "talent" and 
Texiillustration (text illustration) has nothing to do with "textiles".

Note, how Talent-wässerung is matched by pattern sets B, C and E. 
Similar, Textil-lustration is matched by pattern sets C, D and E.  Both 
hyphenations are sense distorting and have to be suppressed by all means.

A sane ranking of the pattern sets would be (in the order of decreasing 
priority):

   1. sense distorting        (E)
        - suppress

   2. word compounds          (A)
        - prefer

   3. undesirable             (D)
        - avoid

   4. affix                   (B)
        - prefer

   5. regular                 (C)
        - if nothing else fits

That results in the following hyphenation weights:

   Text -20- il -0- lus -10- tra -15- ti -5- on
   Tal -20- ent -0- wäs -10- se -15- rung

For the German language, that level of granularity of hyphenation 
control would be great.  Even though, finding a good set of weights 
(demerits) for the paragraph breaking algorithm won't be easy.  The 
current demerits are already awkward enough.  And I won't give much grey 
value for more legible hyphenations.  But if one C-type hyphenation 
turns into an A-type hyphenation or a D-type hyphenation turns into a 
B-type hyphenation say, per page, it pays-off, IMO.

Best regards,
Stephan Hennig


More information about the tex-hyphen mailing list