[tex-hyphen] new Catalan hyphenation patterns

Jaume Ortolà i Font jaumeortola at gmail.com
Mon Feb 18 14:34:35 CET 2013


2013/2/18 Mojca Miklavec <mojca.miklavec.lists at gmail.com>

> 1.) Did you also test these patterns with TeX and what other tests did
> you perform on these patterns? I'm asking because I've spotted some
> potential problems in the file (with eyes, not by testing the
> patterns).
>

Unfortunately, no. I tested the patterns in LibreOffice and InDesign.



> 2.) Is the list of hyphenated words public? If so, may we include the
> list of words into our repository to simplify testing of the patterns
> now and in the future? Also: do you have any (possibly publicly
> available) grammatical excerpt explaining hyphenation rules for
> Catalan? (Gonçal also mentioned the use of a dictionary.) Any
> dictionary would be helpful.
>

I extracted the list of words from this online dictionary [1], which shows
the syllabic division of about a third of the words (just the words with
some difficulty). Gonçal mentioned a paper dictionary by the same
publishing house. There is an open source for syllabic division here [2].
There could be a few minor discrepancies between these sources, but not
many.

[1] http://www.diccionari.cat
[2] http://ca.oslin.org/syllables.php

I attach a full dictionary (73000 words) with syllabic division. You can
consider any of the previous links as the source plus some additions and
corrections by me. There is a lot of derived words not present in this
dictionary. For example, "conjugated verbs" generate hundred of thousands
of new words with different suffixes. But fortunately, the exceptions to
hyphenation affect mostly prefixes.

There are several pages in Catalan which explain the hyphenation rules. But
there is no one complete enough.
http://www.uoc.edu/serveilinguistic/criteris/ortografia/silabes.html


3.) There are some dashes (-) in your patterns. Can you please explain
> a bit its usage in your grammar? I'm asking because dashes are tricky
> in a way and if we want to treat dashes as letters, this might have
> some unexpected consequences, so it needs to be done right and with
> sufficient testing. Office might work differently in that respect (I'm
> not sure though), so the testing would also have to be done in TeX.
> Some languages like Russian include the dashes, but they include *a
> lot* of simple patterns with dashes. In other languages hyphenation of
> composed words can be done in a different way (at least in TeX, I'm
> not sure how office deals with the problem).
>

I remember the problem. In Catalan there are infinitives and gerunds
terminated in -ir and -int that should be written with diaeresis in order
to divide correctly the syllables. For example: con_tri_bu_ïr,
con_tri_bu_ïnt. But, exceptionally, according to the orthographical rules,
this diaeresis is omitted: con_tri_bu_ir, con_tri_bu_int. So we need
patterns like "u1ir." (but "qu4ir."). Moreover, infinitives and gerunds can
be united to a pronoun with a hyphen: contribuir-hi, contribuint-hi. As
LibreOffice considers the hyphen a word character, I had to add these
patterns with hyphens ("u1ir-")



> 4.) Lots of patterns like ".de3s4ar." are in essence just "hyphenation
> exceptions". (But since Open/LibreOffice doesn't allow specifying
> hyphenation exceptions, I understand the reasoning behind this form of
> specifying exceptions.)
>

In fact it is an exception of an exception. The general rule is

1sa

As "des-" is a very common prefix which has to be divided des_, I add these
patterns:

.de2s
.des3a

 And then the final exceptions for "desar" (and all its forms).

5.) Did you try to play with patgen? (It's not necessary, I'm just
> curious.) That one might possibly greatly reduce the number of
> additional patterns.
>
>
No. But I think there is not much room for optimization. From line 231 on,
the patterns are basically exceptions for prefixes.


>  > (left,right)-hyphenmin can be now (1,1), which allows valid hyphenations
> > like "l'e-mulació", although "e-mulació" is generally undesirable and
> > avoided.
>
> 6.) I didn't try it yet, but how would your patterns behave on the
> word "emulació" alone? I didn't understand from this sentence how
> exactly such "problems" are dealt with in your patterns when
> lefthyphenmin is set to 1. Would the word be hyphenated as e-mulació,
> with the argument being that this word would never occur alone?
> (Patterns for many languages treat apostophe as a letter, but then
> again a whole lot of new patterns are needed to properly account for
> that. Having apostrophe in patterns is equally problematic as dashes.)
>

The user should set (left,right)-hyphenmin to (2,2), so emulació is not
hyphenated e_mulació. But it should be possible l'e_mulació. That will be
possible depending on the word tokenization. In LibreOffice and InDesign,
apostrophe is counted as a word character and then we get l'e_mulació with
(2,2)-hyphemin.

In Catalan, the hyphen or dash character is always a hyphenation point, and
the apostrophe is never a hyphenation point. Are there potential problems
on these issues?

I am not a TeX user, but I will try to make some tests.

Regards,
Jaume Ortolà





Thank you,
>     Mojca
>
> On Wed, Dec 19, 2012 at 9:52 AM, Jaume Ortolà i Font
> <jaumeortola at gmail.com> wrote:
> > Hi,
> >
> > I have created a new Catalan hyphenation file (see attachment).
> >
> > The new patterns have been checked against a dictionary (Gran Diccionari
> de
> > la Llengua Catalana, Enciclopèdia Catalana, www.diccionari.cat), with
> 100%
> > valid results. The patterns cover all the exceptions, except for one word
> > that can be hyphenated in two ways depending on its meaning (àcid
> > per-iò-dic, un pe-ri-ò-dic). There remain a dozen rare words that are
> > unclear even for the mentioned dictionary redactors, which I have
> consulted.
> >
> > (left,right)-hyphenmin can be now (1,1), which allows valid hyphenations
> > like "l'e-mulació", although "e-mulació" is generally undesirable and
> > avoided.
> >
> > The current Catalan hyphenation file contains 895 patterns, and the new
> one
> > 1499.
> >
> > Please, mention my contribution as "Jaume Ortolà i Font, 2012
> > (www.riuraueditors.cat), jaumeortola at gmail.com".
> >
> > These patterns are already being distributed in other formats
> > (Libre/OpenOffice).
> >
> > Regards,
> > Jaume Ortolà
> > www.riuraueditors.cat
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/tex-hyphen/attachments/20130218/894f9cbf/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: catalan_hyphenated_words.zip
Type: application/zip
Size: 528977 bytes
Desc: not available
URL: <http://tug.org/pipermail/tex-hyphen/attachments/20130218/894f9cbf/attachment-0001.zip>


More information about the tex-hyphen mailing list