[texhax] TeX hyphenation -- why do so many words get no hyphens

Barbara Beeton bnb at ams.org
Thu Aug 5 15:49:29 CEST 2004

pierre mackay writes,

    Steve Tolkin's list is rather a challenge, and suggests that even the
    patterns should be looked at again.  I do not remember what hyphenation
    list was used to start the generation of hyphen.tex, but it seems to
    have missed some real possibilities.

         I run the risk of being shown up again as careless, but here is
    my suggestion of patterns that might be safe (I have tried \showhyphens{}
    on many of them to confirm that they are in truth not hyphenated.  Could
    it be that processor speed had something to do with a conservatism that
    left many of the rarer possibilities out?)

if i remember correctly, the word list used to detect
patterns was the american heritage dictionary.  that
was pretty comprehensive, although i know of at least
one pattern missed -- it appears in place names like
worces-ter, leices-ter, etc.  but place names aren't
usually in dictionaries, and it's usually considered
better practice to avoid hyphenating of proper names.


some of the words in the list wouldn't be hyphenated in
"good" english typesetting because there wouldn't be
at least two letters before a hyphen at the beginning
of a word (achieved) or three after a hyphen at the end

    It ought to be pretty safe to make hypo, para, and even epi and apo
    into pre-hyphen groups tied to start-of-word.

at least epi and apo might have been left out of the
patterns because of contradictory evidence for epis-co-pal
and apos-tro-phe.  (epi-cen-ter, epi-der-mis and epis-tle
are all hyphenated correctly, as are apos-tle, but
apoc-ry[-]phal is only done halfway.)  there are quite
a few nouns of greek origin that would be hyphenated
after the initial a- if that were permitted, and the
consonants after a-po included in the second syllable.
i'm not surprised that there isn't a simple pattern.

    There are at least some consonant clusters here that could probably be
    included in the patterns.  I have tried, for instance, to think of any
    word that would be mishandled by making the "scr" grouping follow a
    hyphen after any consonant.  Perhaps there is one. In any case,
    all these words ought to be at least in the exceptions list.  At
    present-day processor speeds, a larger exceptions list should
    not be an impediment.

werner lemberg has created a script, hyphenex.sh, that
will convert the tugboat hyphenation exceptions article
into an actual exceptions list.  although this was
highlighted in tugboat 23#3/4, pp.247-248, i find that
i never posted it to ctan; i will try to do this in the
next few days, along with that version of the article.
(it will be announced via ctan-ann and comp.text.tex.)
and the list in the first message in this thread will
provide more entries for the tugboat list.

anyone finding hyphenation exceptions not mentioned in
the tugboat list is invited to send them to me for
inclusion in a future edition.

    I find it interesting that I have run into so few problems of this sort
    in ten years of professional typesetting.  But I still think that
    we might propose improvements in hyphens.tex to Don Knuth.

sorry, pierre.  that's already been proposed and rejected.
the ctan file ushyph.tex was suggested (and has been tested)
as a replacement for the original hyphen.tex, but don has
stated that the original hyphen.tex, although it could be
improved, does not constitute a bug, so is frozen.
the exception list route appears to be our best choice.
							-- bb

