[tex-hyphen] Reviving discussion about Serbian hyphenation patterns

Arthur Reutenauer arthur.reutenauer at normalesup.org
Tue Jul 27 01:11:57 CEST 2010

> Let me clarify this a bit: RAS authors certainly won't publish their
> list. The question is if the following is possible:
> (1) RAS project has (as I assume) ultimate hyphenation database for
>     Serbian language;
> (2) they export it to a text file and
> (3) use it to create new hyphenation patterns following your
>     instructions;
> (4) we get brand new patterns (hopefully without licencing restrictions)
>     without making their database "open" in any way.

  Yes, it's possible technically and would work.  The RAS developers
would only need to use patgen on their word list; its use is not
completely straightforward, but once we get the patterns back it would
work without any problem.  The only issue is legal.

>> Nobody has ever tried that, but I admit that it would be useful in
>> many languages, including mine.
> Yes, I know; it might be the case that you use the same diacritics,
> right?

  Slovenian has even more diacritics because it has the dot below in
addition to all the other signs that are used for Serbian in the tonemic
system, and it has another set of accents for the intermediary,
non-tonemic system :-)

> (Ancient Greek also comes to mind: would it be difficult to make existing
> hyphenation patterns to work with combining diacritical marks?

  The issue of diacritics was never raised with Ancient Greek: all
patterns with diacritics are explicitely duplicated in the pattern file;
in addition, diacritics are not transparent to the
hyphenation process in the case of diphthongs, as we have ε2ί (which is
an orthographical diphthong), for example, but έ3ι (where the vowels
belong to two different syllables).

  Anyway, the final pattern file is actually automatically generated by
a Perl script, so this is where the problem is being solved, not at the
TeX level.

  Note that the current patterns for Ancient Greek don't contain any
combining characters, because we didn't know what to do with them in
8-bit TeX engines, but it would be easy to include them as well, by
modifying the aforementioned script.  There has been one request for it,
but I showed the user how to solve the problem at the XeTeX level, by
using Unicode normalization, so the issue became moot.

> Thank you for your very informative reply. If I understand correctly,
> what you say is that it's not possible to create all-TeX rules that
> says the following:
> (a) ignore all combining diacritics;
> (b) respect all document-level instructions that "these {x,y,z}
>     characters should be treated as they were, say, 'a'".

  I think it should be doable, using \lccode's properly (see
http://tug.org/pipermail/tex-hyphen/2010-May/000541.html and subsequent
messages).  It would need to be corefully specified, though.  


More information about the tex-hyphen mailing list