[tex-hyphen] Newbie: Question about pattern structure

Taylor, P P.Taylor at rhul.ac.uk
Thu Aug 22 15:59:28 CEST 2019


Arthur Reutenauer wrote:


In order to hyphenate a word in a given language, you need a list of
patterns for that language.  Let’s say the word is “hyphenation” and the
patterns are Knuth and Liang’s file hyphen.tex (available from CTAN:
http://mirror.ctan.org/systems/knuth/dist/lib/hyphen.tex).


I think that what Arthur has written is very helpful, but it will surely leave the intelligent reader asking "but how were those patterns generated, and what do the numbers mean".  The introduction to Patgen.web sheds some light on this :

Introduction. This program takes a list of hyphenated words and generates a set of patterns that
can be used by the TEX82 hyphenation algorithm.

The patterns consist of strings of letters and digits, where a digit indicates a 'hyphenation value' for some
intercharacter position. For example, the pattern "3t2ion" speci es that if the string "tion" occurs in a word,
we should assign a hyphenation value of 3 to the position immediately before the "t", and a value of 2 to the
position between the "t" and the "i".

The patterns are generated in a series of sequential passes through the dictionary. In each pass, we
collect count statistics for a particular type of pattern, taking into account the e ffect of patterns chosen in
previous passes. At the end of a pass, the counts are examined and new patterns are selected.
Patterns are chosen one level at a time, in order of increasing hyphenation value. In the sample run
shown below, the parameters "hyph start" and "hyph finish" specify the fi rst and last levels respectively to be
generated.

Patterns at each level are chosen in order of increasing pattern length (usually starting with length 2).
This is controlled by the parameters "pat start" and "pat fi nish" speci ed at the beginning of each level.
Furthermore patterns of the same length applying to di fferent intercharacter positions are chosen in
separate passes through the dictionary.  Since patterns of length n may apply to n + 1 diff erent positions,
choosing a set of patterns of lengths 2 through n for a given level requires (n+1)(n+2)=2 \ge 3 passes through
the word list.

At each level, the selection of patterns is controlled by the three parameters "good wt" , "bad wt"  and "thresh".
A hyphenating pattern will be selected if good * good wt – bad * bad wt \ge thresh , where "good" and "bad" are
the number of times the pattern could and could not be hyphenated respectively at a particular point.
For inhibiting patterns, "good" is the number of errors inhibited, and "bad" is the number of previously found
hyphens inhibited.

The interested reader is referred to (e.g.,) http://readytext.co.uk/files/patgen.pdf
Philip Taylor

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/tex-hyphen/attachments/20190822/d5a555c6/attachment.html>


More information about the tex-hyphen mailing list