[tex-hyphen] Hyphenation in Albanian

Claudio Beccari claudio.beccari at gmail.com
Sun Feb 14 17:14:12 CET 2021


Dear Arthur, dear Mojca
Attached you find a zip file named AlbanianHyphenation.zip.

This is the result of my efforts with the substantial help of MoA Sabina 
Koliqi, original Albanian graduate in Albanian Literature, then Italian 
professor graduated in Education Teaching.
I do not know the Albanian language, but this language is dr Koliqi's 
mother language and is implied by her university studies; I know how to 
build hyphen patterns; we joined our competences and the above .zip file 
contains our results, in particular the hyph-sq.tex file  contains the 
UTF-8 encoded patterns, with a preamble modeled on the other pattern 
files distributed with TeX Live.

We looked for an hyphenated Albanian word list, but we could not find 
any. Dr Koliqi, extracted a word list from a couple of chapters of an 
Albanian book; she tried to create an Albainan hyphenated word list. 
Then I entered the challenge, but I was unsuccessful with the patgen 
program that is distributed with the TeX System; documentation is very 
scarce and refers to the Omega program. As a result we abandoned the 
patgen solution and we moved to another approach that I find very 
effective, even if it requires a lot of "elbow grease".

The approach is based on LuaLaTeX and its ability to load on the fly a 
pattern file and to hyphenate a list of words given as simple text. This 
is provided by package testhyphens.sty and its checkhyphens environment. 
As you see form the zipped file, the source abanian-test-lualatex-2.tex 
loads also the multicol.sty package, in order to typeset the result in 
four column mode; of course the setting for four columns can be changed 
to 1 (one) column and the result may be used as a dictionary if patgen 
is to be used to find another (different) pattern-set created without 
any use of elbow grease. My preceding experience with other languages 
taught me that this elbow grease spent by a sufficiently well educated 
person produces better results than patgen. Of course this statement is 
not valid for certain languages, English in first place, because 
patterns are based on spelling and not on pronunciation; for English in 
both main incarnations, British and US, there are errors that can't be 
corrected because there are homographs that are pronounced differently 
if they refer to nouns or to verbs: for example "the record" and "I 
record"; "the analyses" and "he analyses".

Therefore we started with a basic list of a dozen patterns (the single 
letter patterns with implied 0 values on both sides were omitted, and 
only the Albanian digraphs were considered). After each run of the 
LuaLaTeX compilation dr Koliqi would correct on the printed list the 
wrong hyphenation points; I would modify the pattern list; and we would 
iterate until all words were correctly hyphenated. Non very 
professional, you might think, but very effective.

The Albanian hyphenation is peculiar; Albanians say they have an 
alphabet made up with more than 30 letters; while interacting with dr 
Koliqi I found out that in Albanian they miss a word for "letter" as it 
is implied by any computer encoding, from ASCII to UTF-8, therefore 
"sh", "dh", "zh", and similar digraphs are called with the same name as 
"a", "b", "c", and so on. Eventually we could find a common mutual 
understanding, and we could proceed pretty rapidly.

We worked on an initial set of a little more than 2600 words; then we 
reduced the set to the actual one contained in the LuaLaTeX source file. 
Differently from patgen, the pattern set we built up does not minimize 
the probabilities of hyphenation errors; the number of wrong hyphenated 
words is zero.

Notice: the LuaTeX source file sets both the left and right hyphenmin 
values to 1; in practice the hyphenation language description file 
should set both to the value 2. I always build the hyphen sets with the 
value 1, because I imagine that in some rare cases of narrow column 
typesetting, the correct justification may be achieved with this not too 
professional typographical setting.

But the word set we worked on is limited; and it is possible that while 
actually using this pattern set by the Albanian users with their actual 
documents, some more patterns, or a list of hyphenation exceptions might 
become necessary. I might be available to modify such patterns for a 
short while; at my age I am not going to live for ever; therefore the 
Albanian TeX community should take over.

All the best

Claudio

On 16/06/2020 15:22, Arthur Reutenauer wrote:
> 	Dear Claudio,
>
> On Mon, Jun 15, 2020 at 11:57:33PM +0200, Claudio Beccari wrote:
>> I can certainly ask the student to allow distributing her thesis, but I
>> believe it will not be of great utility, because, as I said, the thesis is
>> in Italian, with very few stretches in Albanian, where the needed rare
>> hyphen points were set by hand.
>    I think the list of hyphenated words would be very useful, so if she’s
> ready to publish that, it would be really great.
>
> 	Best,
>
> 		Arthur

-------------- next part --------------
A non-text attachment was scrubbed...
Name: AlbanianHyphenation.zip
Type: application/zip
Size: 10821 bytes
Desc: not available
URL: <https://tug.org/pipermail/tex-hyphen/attachments/20210214/49e3ddeb/attachment.zip>


More information about the tex-hyphen mailing list.