[tex-hyphen] Reviving discussion about Serbian hyphenation patterns

Mon Jul 26 20:01:45 CEST 2010

2010/7/25 Nikola Lečić wrote:
> Hello Mojca, Arthur, François,
>
> First of all, many thanks to François who kindly forwarded to Mojca and
> Arthur my email that was initially written with polyglossia in mind. It
> seems like a nice opportunity to revive the discussion about Serbian
> hyphenation patterns...

I forwarded the mail to mailing list (it probably belongs there) and
added Dejan to CC.

> I am aware of some of your efforts regarding this subject in the past,
> so I'm posting this off-lists and to the responsible persons only, to
> avoid duplication of existing contents. It you want, feel free to move
> it online.
>
> (1) Cyrillic & Latin patterns in parallel.
>
>    The mail you received from François tells the story. I'd just add
>    that some difficulties might occur here; at least in Xe(La)TeX
>    world, it's still better to keep Serbian Cyrillic and Latin texts
>    separate as they were two different languages. Loading two pattern
>    sets at once works as described, but with some fonts, having Latin
>    characters among Cyrillic ones can produce bad effects on their
>    shapes (unwanted alternates etc).

This is a separate topic from the rest. I'm waiting for François to answer ...

> (2) Non-ekavian dialects of Serbian (oversimplification: Serbian spoken
>    in Bosnia, Montenegro and by Serbs that live in Croatia).
>
>    I encounter these variants pretty often in the texts. If they are
>    written in Latin script, I sometimes use Croatian patterns and that
>    practice gives very good results. However, they are not applicable
>    in Cyrillic script.
>
>    Are you aware of any existing effort to create these?

I'm not aware of any. I don't have enough expertise (for me ever
Croatian and Serbian are both equal ...).
To make it clear: the patterns that we declared for Serbian are
actually "Serbo-Croatian" in origin. We had long (almost endless)
discussions in 2008 before we made an agreement how to call them
(Serbian or Serbo-Croatian).

If you want my opinion: it would be far the best option if all groups
(Serbs, Croats, Bosnians, ...) could cooperate and differentiate
between common rules and differences in ex-serbo-croatian language.
Something similar (or not quite) than what the Germans did (they made
a big project, collecting German words and then differentiating Swiss
German from proper German; but the source was exactly the same). It's
quite possible that one could come up with common hyphenation patterns
+ some extra patterns to differentiate between slightly different
rules between the countries/regions/languages.

The Latin-Cyrillic conversion is almost trivial (there are some
details with dž and dz as far as I remember, but that's still trivial)
and should work for either Serbian or Croatian or any other "dialect"
:) :) of Serbo-Croatian.

At least I would probably be able to convert Croatian patterns into
Cyrillic, but I'm not able to come up with a list of patterns ...

> (3) Making new Serbian hyphenation patterns.
>
>    François told me about Mojca's plans regarding this.

Not really ... I had no such plans, but if you are willing to work on it ...

> To be clear,
>    the existing patterns are very good. They have some obviously
>    systematic errors, mainly around consonants "s" and "š" ("с" and
>    "ш"). [Or, for example, in the last book I typeset I had to
>    introduce ~180 manual hyphenations on ~300 pages text (180 with
>    inflections), which means ~1 unique word on 3 pages.]
>
>    The package serhyplist announced by Zoran Filipović last week on
>    the CTAN can be helpful.

The list might be extremely helpful to help spotting the errors, but
just attaching the list without careful examination of what could
possibly be improved in patterns wouldn't make as much sense as a
complete revision.

>    I am, however, interested to hear your opinion about the following
>    idea. For Serbian language, I am aware of only one professionally
>    written software that produce perfect Serbian hyphenation, the
>    programme RAS:
>
>      http://www.rasprog.com/ (Serbian only)
>
>    It is written by Milorad Simić and the members of his team at the
>    Institute for Serbian Language (Serbian Academy of Science and
>    Arts). This closed software, however, is distributed (or sold) to
>    the end-users only in the form of MS Word plugin. The RAS software
>    is accepted in domestic academic institutions as de facto standard
>    for editing Serbian text.
>
>    I know that their dictionary/hyphenation base contains actually all
>    words that exist in Serbian. We can ask them to generate hyphenation
>    patterns without "opening source" (and I gather from some recent
>    discussions on TL mailing list that O.U.P. did something like this
>    for British English -- correct?).

Not even Knuth has released the word list that he was using for
generating his original patterns and most languages do not publish
those lists. If you are able to get that list, that would be perfect.

> I am pretty sure that every
>    parallel effort would be an unnecessary duplication of their
>    more-than-a-decade long work. If you confirm that this approach is
>    technically possible, I would be very happy to meet Mr. Simić in
>    person and discuss this.

Why wouldn't it be? Once you have a complete list, creating the
patterns out of it is almost-trivial (not entirely trivial since you
need to figure out how to deal with patgen, but it's definitely doable
with much much much less effort than what's needed to assemble
properly hyphenated word lists). Once you get patterns out of the
list, the list may remain in secret (though it makes sense at least
someone has access to it in case that something needs to be fixed or
improved).

In case that you get access to that list, please do test the whole set
of words with the old patterns just to compare the differences (and
maybe write an article about that or at least store those differences
if there are not too many).

--------

Summary: Nikola, I would be very grateful if you (or somebody else,
like Dejan) would be ready to cooperate any work done on Serbian
hyphenation patterns, which includes trying to combine Dejan's
original patterns, Zoran's patches or maybe even trying to convince
commercial providers to give you the complete list of hyphenated
words.

> (4) Diacritical marks.
>
>    The following diacritical marks are usually used in Serbian:
>
>      U+0300 short ascending accent
>      U+0301 long ascending accent
>      U+0302 long descending accent
>      U+030F short descending accent
>      U+0306 as usual, short vowel
>      U+0304 as usual, long vowel
>
>    U+0302 and U+0304 are sometimes used as "genitive sign". "R" can be
>    treated as a vowel sometimes, etc.
>
>    (Not many Unicode fonts support these, especially with proper
>    positioning; Alexey Kryukov's Old Standard does; cf. Old Standard
>    Manual for examples in Serbian.)
>
>    In Cyrillic variant, a few combinations are available precomposed,
>    in Latin script much more is available, but not all, as far as I
>    recall. Hyphenation patterns should ignore these signs; but maybe
>    there are some weird cases where different accentuation implies
>    different hyphenation.
>
>    Also, I don't know if implementing this would break 1-to-1
>    compatibility of babel-polyglossia versions.

This is a slightly more complicated issue. Let's make something clear
first. When speaking about pdfTeX - if some character is not present
in T1/T2A encoding, there's no way to get the hyphenation "right". If
that character is present, it should be possible to create some
"equivalence classes" (I'm not sure if this still holds or not).
Nobody has ever tried that, but I admit that it would be useful in
many languages, including mine. (A with acute is not a new letter, but
merely an additional stress to differentiate meaning of some words or
used in poetry ...)
It might be that everyone considered it of too low priority to implement it.

It might help to write *all* the possible characters and equivalent
classes and then discuss further ... In worst case you could still
generate all the possible combinations in hyphenation patterns from
original ones (if size doesn't bother you).

In XeTeX it's different: if unicode point for such a combination
exists, it will only work if also the font has that glyph. If not, you
need to fake the accent over some glyph and hyphenation gets broken.
If unicode glyph doesn't exists at all then you need to make dirty
tricks (you need to implement [letter]8[accent] rules to simulate that
word may not be hyphenated between some letter and the accent to
follow, but then you are urged to use OpenType accent placement
instead of TeX faking it etc.) I may be wrong in some points, but I
hope that you understand the point that it may get ugly.

With LuaTeX you should get much much more freedom, but you need to
work for it (you need to implement it or to give Hans a good reason to
implement it), but at least you have a chance that it will work
properly.

Mojca