[tex-hyphen] String preparation

Wed May 25 10:00:16 CEST 2016

Dear Eric,

On 20 May 2016 at 18:21, Muller, Eric wrote:
> A few questions:
>
> 1. the hyphenation patterns are meant to work on text that has been
> "normalized" in some way;

In the early days of TeX it was sufficient if it worked with 8-bit
fonts and whatever special treatment the macro package (like Babel)
provided to set the the catcodes of characters.

> I know that at least all uppercase letters should
> be converted to lowercase.

True.

> Looking at the French patterns, I see that they
> account for apostrophe by U+0027 but not for U+2019, so I suppose that
> U+2019 should be folded to U+0027.

This needs a bit of explanation and perhaps a bit of further discussion.

If you take the patterns from

https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/hyph-fr.pat.txt

you'll see that both are present.

If you were looking at hyph-fr.tex then in fact we only have U+0027
there, but that's because 8-bit TeX automatically "converts" U+0027
into U+2019 (or rather has the glyph U+2019 on slot 0x27).

In any case I believe that you should take the patterns from the plain
text file, not from "*.tex".

However a bit of further discussion might be in place. I believe that
we should start supporting equivalence classes at some point. At least
for my mother tongue many characters are absolutely equivalent (for
example o = ó = ò = ô; they don't even change the meaning of the
word). And hyphenation patterns for quite some languages like Turkish
just define equivalence classes and then write the same pattern
repeated for all pairs of characters. It would be a lot "saner" if
patterns would define equivalence classes (including lowercase and
uppercase letters being in the same class; or apostrophes) and then
the engine should support proper interpretation of that.

With Ethiopic the only rule is "feel free to hyphenate anywhere"
(except just before commas etc.). So we made hyphenation patterns
saying just:

    for each letter <l> from the alphabet, add:
        1<l>1

Which could in fact be just a single patterns if we had support for
equivalence classes.

> It also seems that something should be
> done to fold combining sequences to precomposed characters. I could not find
> any documentation of what the normalization should be?

The old TeX did not support combining characters in any way. XeTeX
does some "black magic" in the background (I believe it does some
Unicode normalization, but I don't know the details). I'm not sure
what (if anything) LuaTeX does.

We currently have "œ + combining acute" in one of the patterns and
that one is also "a bit problematic" because it should be treated as a
single glyph (and probably isn't). So we also added "do not hyphenate
before combining acute" which is a bit of a strange rule.

Thai and Lao are also a bit "weird" in a way, with hyphenation
patterns actually trying to prevent "combining characters" to be split
from the rest.

And some hyphenation patterns (mostly for Indic languages) include
rules for non-breaking space etc.

> 2. In a layout engine, the most likely organization is to use Unicode UAX#14
> (may be with tailorings for the locales) to determine linebreak
> opportunities, and then may be to try to hyphenate the pieces between two
> linebreak opportunities. Those fragments can contain pretty much arbitrary
> characters. I suspect that the text between linebreak opportunities should
> be broken into subruns, corresponding to some notion of word. For example,
> with the string "foo<NBSP>…<NBSP>bar" (… is U+2026), it seems that
> hyphenating that whole string returns an hyphenation opportunity after the
> second <NBSP>. I suspect that "foo" and "bar" should be isolated and
> presented independently to the hyphenation engine. But what are the rules
> for that tokenization?

I hope that someone else will answer that question.
(I just wanted to say that TeX has issues with compound words and
situations like that. You probably shouldn't take TeX as your role
model.)

> 3. I suspect that different languages may want different
> normalization/tokenization?
>
> 4. all that suggests that there normalization/tokenization rules should be
> captured with the hyphenation patterns, preferably in a way that can be
> exploited by code.
>
> Are my assumptions correct? has all this already been discussed? resolved?
>
> Incidentally, I found
> <https://wiki.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org#4._Add_hyphenation_rules_for_special_characters>,
> which seems to deal with the same problem.

I want to leave answering those questions to someone else.

Mojca