[tex-hyphen] Reviving discussion about Serbian hyphenation patterns

Mon Jul 26 22:22:59 CEST 2010

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

On Mon, 26 Jul 2010 20:01:45 +0200
Mojca Miklavec <mojca.miklavec.lists at gmail.com> wrote:

> To make it clear: the patterns that we declared for Serbian are
> actually "Serbo-Croatian" in origin. We had long (almost endless)
> discussions in 2008 before we made an agreement how to call them
> (Serbian or Serbo-Croatian).
> [...]
> (for me ever Croatian and Serbian are both equal ...).
> [...]

I sincerely hope that no similar discussion will emerge following these
lines :-) I am sure that nobody would like to read again and again all
those arguments and "arguments"...

(Btw, I tend to agree with you. Creating a unique ex-Yu pattern pool
would be very, very useful thing, whatever be its name.)

> The Latin-Cyrillic conversion is almost trivial (there are some
> details with dž and dz as far as I remember, but that's still trivial)
> and should work for either Serbian or Croatian or any other "dialect"
> :) :) of Serbo-Croatian.

(Only Cyrillic -> Latin is strictly 1-1. The opposite is not: "nj" can
be "нј" or "њ", etc.)

> > (3) Making new Serbian hyphenation patterns.
[...]
> Why wouldn't it be? Once you have a complete list, creating the
> patterns out of it is almost-trivial (not entirely trivial since you
> need to figure out how to deal with patgen, but it's definitely doable
> with much much much less effort than what's needed to assemble
> properly hyphenated word lists). Once you get patterns out of the
> list, the list may remain in secret (though it makes sense at least
> someone has access to it in case that something needs to be fixed or
> improved).
> 
> In case that you get access to that list, please do test the whole set
> of words with the old patterns just to compare the differences (and
> maybe write an article about that or at least store those differences
> if there are not too many).

Let me clarify this a bit: RAS authors certainly won't publish their
list. The question is if the following is possible:

(1) RAS project has (as I assume) ultimate hyphenation database for
    Serbian language;
(2) they export it to a text file and
(3) use it to create new hyphenation patterns following your
    instructions;
(4) we get brand new patterns (hopefully without licencing restrictions)
    without making their database "open" in any way.

> Summary: Nikola, I would be very grateful if you (or somebody else,
> like Dejan) would be ready to cooperate any work done on Serbian
> hyphenation patterns, which includes trying to combine Dejan's
> original patterns, Zoran's patches or maybe even trying to convince
> commercial providers to give you the complete list of hyphenated
> words.

Of course, that's why we are here. :-)

> > (4) Diacritical marks.
> > [...]
> This is a slightly more complicated issue. Let's make something clear
> first. When speaking about pdfTeX - if some character is not present
> in T1/T2A encoding, there's no way to get the hyphenation "right". If
> that character is present, it should be possible to create some
> "equivalence classes" (I'm not sure if this still holds or not).
> Nobody has ever tried that, but I admit that it would be useful in
> many languages, including mine.

Yes, I know; it might be the case that you use the same diacritics,
right?

(Ancient Greek also comes to mind: would it be difficult to make existing
hyphenation patterns to work with combining diacritical marks? I don't
think that situation with Serbain is much different: vowels has a large
number of precomposed combinations with diacritics. However, you can
legally compose them with combining diacritical marks. Additionally,
sometimes some "unexpected" combination may occur, such as epsilon with
circumflex.)

> (A with acute is not a new letter, but merely an additional stress to
> differentiate meaning of some words or used in poetry ...)
> It might be that everyone considered it of too low priority to
> implement it.

Poetry, yes, but not just poetry. Academic philology and dictionaries
should be flooded with diacritical marks.

> It might help to write *all* the possible characters and equivalent
> classes and then discuss further ... In worst case you could still
> generate all the possible combinations in hyphenation patterns from
> original ones (if size doesn't bother you).

Thank you for your very informative reply. If I understand correctly,
what you say is that it's not possible to create all-TeX rules that
says the following:

(a) ignore all combining diacritics;
(b) respect all document-level instructions that "these {x,y,z}
    characters should be treated as they were, say, 'a'".

> In XeTeX it's different: if unicode point for such a combination
> exists, it will only work if also the font has that glyph. If not, you
> need to fake the accent over some glyph and hyphenation gets broken.
> [...]

Very interesting, I didn't know this.

Best regards,
- -- 
Nikola Lečić = Никола Лечић
fingerprint : FEF3 66AF C90E EDC3 D878  7CDC 956D F4AB A377 1C9B
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (FreeBSD)

iJwEAQEDAAYFAkxN7qgACgkQ/MM/0rYIoZjVIgP+KmOBcnIGVgvTjchpjpQV3zbM
qnMbyTS/FsSy+37l5JdwtvUTUbiDR0hFi9pCHEdUCbRJC2HsvXvWWSIWt3zTMSnw
fJvzhd0iYYk4krkKdONDn+m9PEEMK/0liUyW2LNxkmqJO17n3c9kEJx/9HQgVdkG
fqUN4ZHz800cEzWfAvQ=
=p4jF
-----END PGP SIGNATURE-----