[tex-hyphen] Newbie: Question about pattern structure

Taylor, P P.Taylor at rhul.ac.uk
Thu Aug 22 16:17:33 CEST 2019


Philip Taylor wrote (some poorly converted PDF to text).  Herewith a hopefully better version —

Introduction. This program takes a list of hyphenated words and generates a set of patterns that
can be used by the TEX82 hyphenation algorithm.

The patterns consist of strings of letters and digits, where a digit indicates a 'hyphenation value' for some
intercharacter position.  For example, the pattern "3t2ion" specifies that if the string "tion" occurs in a word,
we should assign a hyphenation value of 3 to the position immediately before the "t", and a value of 2 to the
position between the "t" and the "i".

The patterns are generated in a series of sequential passes through the dictionary.  In each pass, we
collect count statistics for a particular type of pattern, taking into account the effect of patterns chosen in
previous passes.  At the end of a pass, the counts are examined and new patterns are selected.
Patterns are chosen one level at a time, in order of increasing hyphenation value.  In the sample run
shown below, the parameters "hyph start" and "hyph finish" specify the first and last levels to be
generated respectively.

Patterns at each level are chosen in order of increasing pattern length (usually starting with length 2).
This is controlled by the parameters "pat start" and "pat finish", specified at the beginning of each level.
Furthermore, patterns of the same length applying to different inter-character positions are chosen in
separate passes through the dictionary.  Since patterns of length n may apply to n + 1 different positions,
choosing a set of patterns of lengths 2 through n for a given level requires (n+1)(n+2) / 2 – 3 passes through
the word list.

At each level, the selection of patterns is controlled by the three parameters "good wt" , "bad wt"  and "thresh".
A hyphenating pattern will be selected if (good * good wt – bad * bad wt) \ge thresh , where "good" and "bad" are
the number of times the pattern could and could not be hyphenated at a particular point respectively.
For inhibiting patterns, "good" is the number of errors inhibited and "bad" is the number of previously found
hyphens inhibited.

The interested reader is referred to (e.g.,) http://readytext.co.uk/files/patgen.pdf
Philip Taylor


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/tex-hyphen/attachments/20190822/28573755/attachment-0001.html>


More information about the tex-hyphen mailing list