[tex-hyphen] request about equivalent characters

Mojca Miklavec mojca.miklavec.lists at gmail.com
Wed Mar 14 19:38:29 CET 2012


Dear Pander,

The issue below doesn't have so much influence on input for the new
patgen, but it has a lot of influence on interpretation of data
(=hyphenation).

In Slovenian many letters are considered equivalent.
    ta-bla is equal to tá-bla

The general rule is that any given accent on a letter is treated
equivalent to the letter itself (with a few exception; č is different
from c for example). Generating all possible variants in input is no
possible. User should be able to specify those groups. In most cases
this only makes a difference for hyphenating words, not so much for
generating patterns, but if input consists of a rule that "a" is
equivalent to "e" and then a long list of words composed of a mixture
of both letters, this should be taken care of automatically (even if
only by preprocessor).

Many languages have patterns auto-generated and patterns are similar to
    <consonant><vowel>1
where the two are looping through all the existent consonants and
vowels in that language, meaning that all vowels are exactly
equivalent. In a proper implementation that could be replaced by a
single pattern "ab1".

Another example is apostrophe (U+0027 and U+2019).

May I request putting into specification an explicit requirement that:
1.) Algorithm should be able to treat equivalent letters as
equivalent, without consuming any extra space (that is: if there are
1000 equivalent consonants and 1000 equivalent vowels, one should not
end up with million patterns, but with a single pattern), in the same
way as lowercase and uppercase are treated equally
2.) Please take special care about Turkish dotless i where uppercase
and lowercase are different (meaning that equivalent letters are not
simply i and I).
3.) It would help to devote a special chapter to composed characters
like "c+composing caron", or an example for Slovenian: "a+any
composing accent from some set" is equivalent to "a" is aquivalent to
{á, à, ä, ...}. It might help if hyphenation algorithm is aware that
"a + composing accent" is always a single letter. In particular this
applies to Greek. Again, it is almost impossible to treat all
combinations as an input into patgen. I cannot create all combination
of combined and not-combined extended latin characters when preparing
an input for patgen. But the hyphenation algorithm should be able to
deal with those.
3a.) In some cases like Lao it is probably easier to treat what we
would call "accents" (= vowels placed above a consonant; combining
wovels :) as separate letters.
4.) Make it possible for languages without spaces like Lao or ancient
Ethiopic to hyphenate words (I can provide more input/explanation
about that, but this is probably unrelated to everything else that you
plan to do and it might not even fall into the category of
hyphenation).

Mojca



More information about the tex-hyphen mailing list