[Tuglist] help needed for having Tamil hyphenation in LaTeX

Radhakrishnan CV tuglist@tug.org.in
Fri, 10 May 2002 16:47:52 +0530 (IST)


Forwarded is a mail from Prasanna David who has put in considerable
efforts in the hyphenation patterns in Tamil. This should be of much
interest to those working on Indic scripts and Omega. All are
requested to offer suggestions for David to proceed further.

-- 
Radhakrishnan

---------- Forwarded message ----------
Date: Fri, 10 May 2002 16:00:32 +1000
From: Prasanna David G <prasanna@au-kbc.org>
To: cvr@river-valley.com
Subject: help needed for having Tamil hyphenation in LaTeX

Dear Mr. Radhakrishnan,

I am working as sys admin in AU-KBC Research Centre. Basically I am
also interested in NLP (I was in the NLP team in the beginning). Out
of my own interest and also because of motivation by my one of my
professors, I attempted to develop a Tamil hyphenation algorithm.
But since I was given other responsibilities, I wasn't really able
to do justice to this problem.  Finally, some 4 months back, I had
one MCA project student work on it.

At first I didn't concentrate on how to integrate with existing word
processors, etc. I was more concentrating on the linguist side, like
how to break the word properly. Also, since a pucca morphological
analyser is not yet ready (AU-KBC NLP team will come out with such
one in a month or so - now it is giving almost 92% coverage on
ordinary text - not domain specific), and it will need more
resources, I thought of having morph as a final option.

Now, this is how our algorithm (roughly) works. It takes a word and
suggests all possible hyphenation positions in that word.

For example : paTittukkoNTirukkiRaan => paTit - tuk - koNTiruk - ki - Raan

Step 0 : (Not implemented but plan to give as an option for the
user) :  Check for root words using a root word dictionary. (you
know this dictionary will be of huge size with around 2000 verbs and
infinite number of nouns)

Step 1 : Check for suffixes (both case suffixes and GNP and Tense
markers)  using a dictionary of suffixes and hyphenate before them.
(recursive). In the above example, "aan" is a GNP(Gender, Number,
Person) marker and "kkiR"  is a tense marker. But we are doing some
adjustments such that the letter after hyphen is not a pure
consonant, etc.

Step 2 : Check for auxiliary verbs. Auxiliary verbs should not be
split further. In the above exampe, "koNTiru" is an aux.

We keep those substrings that were already processed and should not
be split further. So each step processes only the rest.

Step 3 : For those substrings that were left out in the above steps,
we apply some syllable analysis and split them. To start with, we
took the rules form the poetry's "asai pirithal" technique (Neer,
Nirai, etc) and we modified some rules and added more. This will
take care of rest of the parts.

Yet, some fine tuning is needed. Right now, our algorithm's output
is almost up to the standards of Tamil News papers(column mode -
more hyphenation spots needed compromising beauty/readability a
little bit). But will take atleast one more month to make it better
- suitable for books or reports.

Now, I want to use it with LaTeX.  By searching the net, I found
that using "patgen", we can generate patterns for Latex.  But I
don't know exactly how to use it.  My algorithm will give all
possible hyphenation positions in a word like in the above example.  
Like that I can generate lot of words using the huge corpus we have
(4 lakh unique words).

I would be happy if you could guide me or point me to what I should
do to have latex use these hyphenation patterns.  Expecting your
early reply.

Thanking you,
Regards,
Prasanna David

AU-KBC Research Centre,             prasanna@au-kbc.org
MIT Campus of Anna University,      prasannadavid@vsnl.net
Chromepet, Chennai - 600 044,       prasannadavid@yahoo.com
Tamil Nadu, INDIA.
Ph : 91-44-223 2711, 223 4885;      Fax : 91-44-223 1034