[tex-hyphen] String preparation
emuller at amazon.com
Fri May 20 18:21:33 CEST 2016
A few questions:
1. the hyphenation patterns are meant to work on text that has been
"normalized" in some way; I know that at least all uppercase letters
should be converted to lowercase. Looking at the French patterns, I see
that they account for apostrophe by U+0027 but not for U+2019, so I
suppose that U+2019 should be folded to U+0027. It also seems that
something should be done to fold combining sequences to precomposed
characters. I could not find any documentation of what the normalization
2. In a layout engine, the most likely organization is to use Unicode
UAX#14 (may be with tailorings for the locales) to determine linebreak
opportunities, and then may be to try to hyphenate the pieces between
two linebreak opportunities. Those fragments can contain pretty much
arbitrary characters. I suspect that the text between linebreak
opportunities should be broken into subruns, corresponding to some
notion of word. For example, with the string "foo<NBSP>…<NBSP>bar" (… is
U+2026), it seems that hyphenating that whole string returns an
hyphenation opportunity after the second <NBSP>. I suspect that "foo"
and "bar" should be isolated and presented independently to the
hyphenation engine. But what are the rules for that tokenization?
3. I suspect that different languages may want different
4. all that suggests that there normalization/tokenization rules should
be captured with the hyphenation patterns, preferably in a way that can
be exploited by code.
Are my assumptions correct? has all this already been discussed? resolved?
Incidentally, I found
which seems to deal with the same problem.
More information about the tex-hyphen