[tex-hyphen] Hyphenation in Albanian
Claudio Beccari
claudio.beccari at gmail.com
Sun Feb 14 17:14:12 CET 2021
Dear Arthur, dear Mojca
Attached you find a zip file named AlbanianHyphenation.zip.
This is the result of my efforts with the substantial help of MoA Sabina
Koliqi, original Albanian graduate in Albanian Literature, then Italian
professor graduated in Education Teaching.
I do not know the Albanian language, but this language is dr Koliqi's
mother language and is implied by her university studies; I know how to
build hyphen patterns; we joined our competences and the above .zip file
contains our results, in particular the hyph-sq.tex file contains the
UTF-8 encoded patterns, with a preamble modeled on the other pattern
files distributed with TeX Live.
We looked for an hyphenated Albanian word list, but we could not find
any. Dr Koliqi, extracted a word list from a couple of chapters of an
Albanian book; she tried to create an Albainan hyphenated word list.
Then I entered the challenge, but I was unsuccessful with the patgen
program that is distributed with the TeX System; documentation is very
scarce and refers to the Omega program. As a result we abandoned the
patgen solution and we moved to another approach that I find very
effective, even if it requires a lot of "elbow grease".
The approach is based on LuaLaTeX and its ability to load on the fly a
pattern file and to hyphenate a list of words given as simple text. This
is provided by package testhyphens.sty and its checkhyphens environment.
As you see form the zipped file, the source abanian-test-lualatex-2.tex
loads also the multicol.sty package, in order to typeset the result in
four column mode; of course the setting for four columns can be changed
to 1 (one) column and the result may be used as a dictionary if patgen
is to be used to find another (different) pattern-set created without
any use of elbow grease. My preceding experience with other languages
taught me that this elbow grease spent by a sufficiently well educated
person produces better results than patgen. Of course this statement is
not valid for certain languages, English in first place, because
patterns are based on spelling and not on pronunciation; for English in
both main incarnations, British and US, there are errors that can't be
corrected because there are homographs that are pronounced differently
if they refer to nouns or to verbs: for example "the record" and "I
record"; "the analyses" and "he analyses".
Therefore we started with a basic list of a dozen patterns (the single
letter patterns with implied 0 values on both sides were omitted, and
only the Albanian digraphs were considered). After each run of the
LuaLaTeX compilation dr Koliqi would correct on the printed list the
wrong hyphenation points; I would modify the pattern list; and we would
iterate until all words were correctly hyphenated. Non very
professional, you might think, but very effective.
The Albanian hyphenation is peculiar; Albanians say they have an
alphabet made up with more than 30 letters; while interacting with dr
Koliqi I found out that in Albanian they miss a word for "letter" as it
is implied by any computer encoding, from ASCII to UTF-8, therefore
"sh", "dh", "zh", and similar digraphs are called with the same name as
"a", "b", "c", and so on. Eventually we could find a common mutual
understanding, and we could proceed pretty rapidly.
We worked on an initial set of a little more than 2600 words; then we
reduced the set to the actual one contained in the LuaLaTeX source file.
Differently from patgen, the pattern set we built up does not minimize
the probabilities of hyphenation errors; the number of wrong hyphenated
words is zero.
Notice: the LuaTeX source file sets both the left and right hyphenmin
values to 1; in practice the hyphenation language description file
should set both to the value 2. I always build the hyphen sets with the
value 1, because I imagine that in some rare cases of narrow column
typesetting, the correct justification may be achieved with this not too
professional typographical setting.
But the word set we worked on is limited; and it is possible that while
actually using this pattern set by the Albanian users with their actual
documents, some more patterns, or a list of hyphenation exceptions might
become necessary. I might be available to modify such patterns for a
short while; at my age I am not going to live for ever; therefore the
Albanian TeX community should take over.
All the best
Claudio
On 16/06/2020 15:22, Arthur Reutenauer wrote:
> Dear Claudio,
>
> On Mon, Jun 15, 2020 at 11:57:33PM +0200, Claudio Beccari wrote:
>> I can certainly ask the student to allow distributing her thesis, but I
>> believe it will not be of great utility, because, as I said, the thesis is
>> in Italian, with very few stretches in Albanian, where the needed rare
>> hyphen points were set by hand.
> I think the list of hyphenated words would be very useful, so if she’s
> ready to publish that, it would be really great.
>
> Best,
>
> Arthur
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AlbanianHyphenation.zip
Type: application/zip
Size: 10821 bytes
Desc: not available
URL: <https://tug.org/pipermail/tex-hyphen/attachments/20210214/49e3ddeb/attachment.zip>
More information about the tex-hyphen
mailing list.