[tex-hyphen] Reviving discussion about Serbian hyphenation patterns

Nikola Lečić nikola.lecic at anthesphoria.net
Mon Jul 26 22:22:59 CEST 2010

Hash: RIPEMD160

On Mon, 26 Jul 2010 20:01:45 +0200
Mojca Miklavec <mojca.miklavec.lists at gmail.com> wrote:
> To make it clear: the patterns that we declared for Serbian are
> actually "Serbo-Croatian" in origin. We had long (almost endless)
> discussions in 2008 before we made an agreement how to call them
> (Serbian or Serbo-Croatian).
> [...]
> (for me ever Croatian and Serbian are both equal ...).
> [...]

I sincerely hope that no similar discussion will emerge following these
lines :-) I am sure that nobody would like to read again and again all
those arguments and "arguments"...

(Btw, I tend to agree with you. Creating a unique ex-Yu pattern pool
would be very, very useful thing, whatever be its name.)

> The Latin-Cyrillic conversion is almost trivial (there are some
> details with dž and dz as far as I remember, but that's still trivial)
> and should work for either Serbian or Croatian or any other "dialect"
> :) :) of Serbo-Croatian.

(Only Cyrillic -> Latin is strictly 1-1. The opposite is not: "nj" can
be "нј" or "њ", etc.)

> > (3) Making new Serbian hyphenation patterns.
> Why wouldn't it be? Once you have a complete list, creating the
> patterns out of it is almost-trivial (not entirely trivial since you
> need to figure out how to deal with patgen, but it's definitely doable
> with much much much less effort than what's needed to assemble
> properly hyphenated word lists). Once you get patterns out of the
> list, the list may remain in secret (though it makes sense at least
> someone has access to it in case that something needs to be fixed or
> improved).
> In case that you get access to that list, please do test the whole set
> of words with the old patterns just to compare the differences (and
> maybe write an article about that or at least store those differences
> if there are not too many).

Let me clarify this a bit: RAS authors certainly won't publish their
list. The question is if the following is possible:

(1) RAS project has (as I assume) ultimate hyphenation database for
    Serbian language;
(2) they export it to a text file and
(3) use it to create new hyphenation patterns following your
(4) we get brand new patterns (hopefully without licencing restrictions)
    without making their database "open" in any way.

> Summary: Nikola, I would be very grateful if you (or somebody else,
> like Dejan) would be ready to cooperate any work done on Serbian
> hyphenation patterns, which includes trying to combine Dejan's
> original patterns, Zoran's patches or maybe even trying to convince
> commercial providers to give you the complete list of hyphenated
> words.

Of course, that's why we are here. :-)

> > (4) Diacritical marks.
> > [...]
> This is a slightly more complicated issue. Let's make something clear
> first. When speaking about pdfTeX - if some character is not present
> in T1/T2A encoding, there's no way to get the hyphenation "right". If
> that character is present, it should be possible to create some
> "equivalence classes" (I'm not sure if this still holds or not).
> Nobody has ever tried that, but I admit that it would be useful in
> many languages, including mine.

Yes, I know; it might be the case that you use the same diacritics,

(Ancient Greek also comes to mind: would it be difficult to make existing
hyphenation patterns to work with combining diacritical marks? I don't
think that situation with Serbain is much different: vowels has a large
number of precomposed combinations with diacritics. However, you can
legally compose them with combining diacritical marks. Additionally,
sometimes some "unexpected" combination may occur, such as epsilon with

> (A with acute is not a new letter, but merely an additional stress to
> differentiate meaning of some words or used in poetry ...)
> It might be that everyone considered it of too low priority to
> implement it.

Poetry, yes, but not just poetry. Academic philology and dictionaries
should be flooded with diacritical marks.

> It might help to write *all* the possible characters and equivalent
> classes and then discuss further ... In worst case you could still
> generate all the possible combinations in hyphenation patterns from
> original ones (if size doesn't bother you).

Thank you for your very informative reply. If I understand correctly,
what you say is that it's not possible to create all-TeX rules that
says the following:

(a) ignore all combining diacritics;
(b) respect all document-level instructions that "these {x,y,z}
    characters should be treated as they were, say, 'a'".

> In XeTeX it's different: if unicode point for such a combination
> exists, it will only work if also the font has that glyph. If not, you
> need to fake the accent over some glyph and hyphenation gets broken.
> [...]

Very interesting, I didn't know this.

Best regards,
- -- 
Nikola Lečić = Никола Лечић
fingerprint : FEF3 66AF C90E EDC3 D878  7CDC 956D F4AB A377 1C9B
Version: GnuPG v2.0.11 (FreeBSD)


More information about the tex-hyphen mailing list