[tex-hyphen] String preparation

Hans Hagen pragma at wxs.nl
Wed May 25 10:19:49 CEST 2016

On 5/25/2016 10:00 AM, Mojca Miklavec wrote:
> Dear Eric,
> On 20 May 2016 at 18:21, Muller, Eric wrote:
>> A few questions:
>> 1. the hyphenation patterns are meant to work on text that has been
>> "normalized" in some way;
> In the early days of TeX it was sufficient if it worked with 8-bit
> fonts and whatever special treatment the macro package (like Babel)
> provided to set the the catcodes of characters.
>> I know that at least all uppercase letters should
>> be converted to lowercase.
> True.
>> Looking at the French patterns, I see that they
>> account for apostrophe by U+0027 but not for U+2019, so I suppose that
>> U+2019 should be folded to U+0027.
> This needs a bit of explanation and perhaps a bit of further discussion.
> If you take the patterns from
> https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/hyph-fr.pat.txt
> you'll see that both are present.
> If you were looking at hyph-fr.tex then in fact we only have U+0027
> there, but that's because 8-bit TeX automatically "converts" U+0027
> into U+2019 (or rather has the glyph U+2019 on slot 0x27).
> In any case I believe that you should take the patterns from the plain
> text file, not from "*.tex".
> However a bit of further discussion might be in place. I believe that
> we should start supporting equivalence classes at some point. At least
> for my mother tongue many characters are absolutely equivalent (for
> example o = ó = ò = ô; they don't even change the meaning of the
> word). And hyphenation patterns for quite some languages like Turkish
> just define equivalence classes and then write the same pattern
> repeated for all pairs of characters. It would be a lot "saner" if
> patterns would define equivalence classes (including lowercase and
> uppercase letters being in the same class; or apostrophes) and then
> the engine should support proper interpretation of that.

it would be much slower to consult classed instead of characters so in 
the end an engine would create hashes (which would internally consume as 
mem as well)

so, the approach of expanding patterns once they are made using these 
alternative characters (as done now by some people as you mention) makes 
much sense

also, adding more and more trickery for the sake of a few languages 
(after all, hyphenation is only valid for a subset of languages and the 
quality is quite acceptable too)

> With Ethiopic the only rule is "feel free to hyphenate anywhere"
> (except just before commas etc.). So we made hyphenation patterns
> saying just:
>     for each letter <l> from the alphabet, add:
>         1<l>1
> Which could in fact be just a single patterns if we had support for
> equivalence classes.

or kick in some specialized hyphenator (in lua) which makes more sense 
than adding a disc node every character

>> It also seems that something should be
>> done to fold combining sequences to precomposed characters. I could not find
>> any documentation of what the normalization should be?
> The old TeX did not support combining characters in any way. XeTeX
> does some "black magic" in the background (I believe it does some
> Unicode normalization, but I don't know the details). I'm not sure
> what (if anything) LuaTeX does.

nothing as the principle is: "what goes in travels through" .. one can 
kick in a preprocessor (or file read callback) an dall depends on what 
one wants to achieve (in verbatim explaining these matters one might not 
want to combine)

> We currently have "œ + combining acute" in one of the patterns and
> that one is also "a bit problematic" because it should be treated as a
> single glyph (and probably isn't). So we also added "do not hyphenate
> before combining acute" which is a bit of a strange rule.

can be made a virtual character in luatex

> Thai and Lao are also a bit "weird" in a way, with hyphenation
> patterns actually trying to prevent "combining characters" to be split
> from the rest.
> And some hyphenation patterns (mostly for Indic languages) include
> rules for non-breaking space etc.
>> 2. In a layout engine, the most likely organization is to use Unicode UAX#14
>> (may be with tailorings for the locales) to determine linebreak
>> opportunities, and then may be to try to hyphenate the pieces between two
>> linebreak opportunities. Those fragments can contain pretty much arbitrary
>> characters. I suspect that the text between linebreak opportunities should
>> be broken into subruns, corresponding to some notion of word. For example,
>> with the string "foo<NBSP>…<NBSP>bar" (… is U+2026), it seems that
>> hyphenating that whole string returns an hyphenation opportunity after the
>> second <NBSP>. I suspect that "foo" and "bar" should be isolated and
>> presented independently to the hyphenation engine. But what are the rules
>> for that tokenization?
> I hope that someone else will answer that question.
> (I just wanted to say that TeX has issues with compound words and
> situations like that. You probably shouldn't take TeX as your role
> model.)

there are too many variants and solutions possible but with lua juggling 
one can do a lot

(in context for instance we have some and more will follow)

>> 3. I suspect that different languages may want different
>> normalization/tokenization?
>> 4. all that suggests that there normalization/tokenization rules should be
>> captured with the hyphenation patterns, preferably in a way that can be
>> exploited by code.
>> Are my assumptions correct? has all this already been discussed? resolved?
>> Incidentally, I found
>> <https://wiki.openoffice.org/wiki/Documentation/SL/Using_TeX_hyphenation_patterns_in_OpenOffice.org#4._Add_hyphenation_rules_for_special_characters>,
>> which seems to deal with the same problem.
> I want to leave answering those questions to someone else.


                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
       tel: 038 477 53 69 | www.pragma-ade.com | www.pragma-pod.nl

More information about the tex-hyphen mailing list