[tex-hyphen] String preparation

Fri May 20 18:21:33 CEST 2016

A few questions:

1. the hyphenation patterns are meant to work on text that has been 
"normalized" in some way; I know that at least all uppercase letters 
should be converted to lowercase. Looking at the French patterns, I see 
that they account for apostrophe by U+0027 but not for U+2019, so I 
suppose that U+2019 should be folded to U+0027. It also seems that 
something should be done to fold combining sequences to precomposed 
characters. I could not find any documentation of what the normalization 
should be?

2. In a layout engine, the most likely organization is to use Unicode 
UAX#14 (may be with tailorings for the locales) to determine linebreak 
opportunities, and then may be to try to hyphenate the pieces between 
two linebreak opportunities. Those fragments can contain pretty much 
arbitrary characters. I suspect that the text between linebreak 
opportunities should be broken into subruns, corresponding to some 
notion of word. For example, with the string "foo<NBSP>…<NBSP>bar" (… is 
U+2026), it seems that hyphenating that whole string returns an 
hyphenation opportunity after the second <NBSP>. I suspect that "foo" 
and "bar" should be isolated and presented independently to the 
hyphenation engine. But what are the rules for that tokenization?

3. I suspect that different languages may want different 
normalization/tokenization?

4. all that suggests that there normalization/tokenization rules should 
be captured with the hyphenation patterns, preferably in a way that can 
be exploited by code.

Are my assumptions correct? has all this already been discussed? resolved?

Incidentally, I found 
<https://wiki.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org#4._Add_hyphenation_rules_for_special_characters>, 
which seems to deal with the same problem.

Thanks,
Eric.