[tex-hyphen] weighting hyphenation points
Stephan Hennig
mailing_list at arcor.de
Thu May 20 11:34:17 CEST 2010
Am 19.05.2010 21:55, schrieb Stephan Hennig:
> * a regular pattern set containing all valid hyphenations,
> * a compound-word pattern set, that matches only word compounds,
> * an undesirable pattern set, that recognizes valid, but undesirable
> hyphenations
Realistically, I can think of five different pattern sets for
hyphenation of the German language (in the order of decreasing weight):
A) word compounds
- hyphenation is much preferred
- weight 20
- examples: Text-illustration, Tal-entwässerung
B) affix hyphenation
- still appreciated, Knuth had experimented with this before
Liang developed the dictionary algorithm
- weight 15
- examples: Textillustra-tion, Talent-wässe-rung
C) all valid hyphenations
- these correspond to the current patterns
- weight 10
- -examples: Text-il-lus-tra-ti-on, Tal-ent-wäs-se-rung
D) undesirable hyphenations
- hyphenations near a word compound or word boundary
- weight 5
- examples: Textil-lustrati-on
E) sense distorting
- to be suppressed by all means
- weight: 1 or zero
- examples: Talent-wässerung, Textil-lustration
I hope the examples in case E are understandable even for non-Germans.
Talentwässerung (valley drainage) has nothing to do with "talent" and
Texiillustration (text illustration) has nothing to do with "textiles".
Note, how Talent-wässerung is matched by pattern sets B, C and E.
Similar, Textil-lustration is matched by pattern sets C, D and E. Both
hyphenations are sense distorting and have to be suppressed by all means.
A sane ranking of the pattern sets would be (in the order of decreasing
priority):
1. sense distorting (E)
- suppress
2. word compounds (A)
- prefer
3. undesirable (D)
- avoid
4. affix (B)
- prefer
5. regular (C)
- if nothing else fits
That results in the following hyphenation weights:
Text -20- il -0- lus -10- tra -15- ti -5- on
Tal -20- ent -0- wäs -10- se -15- rung
For the German language, that level of granularity of hyphenation
control would be great. Even though, finding a good set of weights
(demerits) for the paragraph breaking algorithm won't be easy. The
current demerits are already awkward enough. And I won't give much grey
value for more legible hyphenations. But if one C-type hyphenation
turns into an A-type hyphenation or a D-type hyphenation turns into a
B-type hyphenation say, per page, it pays-off, IMO.
Best regards,
Stephan Hennig
More information about the tex-hyphen
mailing list