[tex-hyphen] Latin Hyphenation when using utf8

Claudio Beccari claudio.beccari at gmail.com
Wed Jun 23 20:00:21 CEST 2010

Dear Andrew,
you already got good suggestions by Mojca and Arthur; they show the only
path that you can follow if you want to mark with macrons and breves all
the vowels of a Latin text.

Now the five lowercase and upper case vowels a, e, i, o, and u with
macrons and breves have all hexadecimal UNICODE codes in the second
UNICODE page (this means that the first byte is 01 for all of them and
the second byte ranges from 00 to 6D-- from Ā = 0100 to ŭ = 016D -- I
did not look up for ȳ, which appears to me a very rare "Latin" vowel;
yes I know that in transliteration from Greek it might be used, but it
is not Latin "proper").

Therefore, utf8x as an option for the inputnec package, and this may be
OK, but the T1 option for the fontenc package does not work, because
this encoding contains mostly selected glyphs useful for western
languages, including the abreve for Romanian, but, if I did not miss any
glyph, it does not contain all the 20 upper and lower case vowels with
macrons and breves.

Therefore you have to use Xe(La)TeX or lua(La)TeX. The variants
containing the letters La allow you to use the familiar LaTeX markup,
but they allow to use all the bells and whistles those programs offer; I
have no experience with Xe(La)TeX, but I have a minimal
(infinitesimal?)  experience with Lua(La)TeX and I am surprised of the
many things that can be done with this typesetting engine.

Another solution would be to redefine the original macros \= and \u, for
the macron and the breve respectively, in such a way that before and
after the marked vowel they insert a \allowhyphens declaration, a macro
that is defined in babel and that introduces a zero-width blob of glue
preceded by an infinite penalty; by so doing a word containing a marked
vowel gets split into three pieces: (a) what precedes such vowel, (b)
the marked wowel, and (c) what follows such vowel.  The fragments (a)
and (c) get proper hyphenation, but the word is never split just before
or just after the marked vowel. A possible break point is missed, but at
least the rest is properly hyphenated.

But, AAAAARGH!!! catastrophe!!!! if you mark all the vowel of every
word, this procedure fails miserably; this was the trick I used when the
old TeX2.x was available; it used only 128 glyph fonts (the standard CM
fonts) and it was almost impossible to typeset in a decent way texts
containing sentences in languages such as Spanish, or Portuguese, or
Catalan, and, most regrettably, French; let's not speak about Romanian.

At that time we already had foreign European students on Erasmus
mobility, and they had to typeset their theses in our technical
University in their home languages; I succeeded setting up different
versions of LaTeX (one for each language, since at that time TeX could
handle one language at a time) using the above trick; it worked pretty
well with the Hiberian languages, that use many accents, but not more
than one per word; it worked not so well with French, that may use
several accents on a single word (for example, électricité); it worked
definitely in a bad way in Romanian where I found even words with 5
special national characters. In any case it was better than nothing. It
was necessary to insert by hand several \- commands in order to adjust
line breaking, but at least something more or less automatic was better
than nothing. Fortunately enough TeX 3 came up at the beginning of the
nineties  together with the DC (EC) T1 encoded fonts and all these
problems vanished.

Nevetheless...

The language description files for Italian and Latin contain the
definition of the double quote character " as an active character; when
inserted inside a word between to "normal" letters (letters without
diacritics) it inserts a discretionary break that does not forbid the
hyphenation of both word fragments: it is used for introducing
etymological break points where the patterns would simply produce
phonetic break points. If the potential break point just precedes a
special letter input to TeX with an "accent" macro (such as \= or \u,
for example) then the discretionary break must be introduced with "|.

Your problem with ȳ might be solved with a simple macro, if the
discretionary break should immediately precede the glyph:

\def\y{"|\=y}

But, let me ask a naïve question: Why would you spell every Latin word
with the "longa" or "brevis" melodic poetic notation?

Today I believe that very few modern languages have semantic differences
depending on the length of the syllables while missing rhythmic accents;
in western Europe I think that the language they speak in Chekia is one
of such remaining languages, but I might be completely wrong. No modern
person could read Latin prose maintaining the longa and brevis duration
of each syllable as it is marked with \= and \u.

But, you might say, in poetry this is very important; yes I agree, but I
remember reading Virgil and other poets (and also the Greek tragedies)
when I was in high school (I graduated 52 years ago) and there was no
longa/brevis indication on any vowel; myself and my schoolmates had
clear in mind the rhythmic differences of dactili and spondaei, the
rhythmic differences between hexametri and pentametri, and rarely missed
a correct metric division. At the final State exam, very difficult and
very selective, at the end of high school, we were supposed to read
Latin and Greek poetry with the rhythmics these poems required "at first
sight", that means without trying in advance two or three times each
verse. The only difficulty for us, and possibly for our instructors, was
to respect the Greek tones, raising or lowering the voice, or, even
worse, doing a vibrato with the long vowels marked with a
perispomeni/circumflex. I don't claim we were infallible, but as 18 or
19 year old teenagers we were all doing pretty well; we knew the rules
and we could apply them at first sight. Then, why stressing the
typographical capabilities of our beloved typesetting engines for
something that may appear as useless to an ignorant as myself? (remember
that although I had classical studies in high school, I am an engineer
and spent most of my working years as a researcher and a professor of
electronics in University).

Certainly a different approach is necessary in a dictionary or in a
grammar, but in that case single words are marked with their longae and
breves; even more important would be to mark the longae and breves on
single words when philological considerations are being made, especially
for discussing the derivation of modern Romance languages from their
Latin father, taking into account also the substrate and the adstrate of
the populations where Latin was forced upon by the conquerors. By
knowing even few elements of romance philology, a person with even a
basic latin education can understand most easily the Latin derived words
of many official languages and their varieties; ... English included,
since its Saxon bases are diluted in a large see of Latin based words
introduced by the Normans in the XI century.

Forgive my intrusiveness, please :-)
Claudio

Andrew Gollan wrote:
> grātiās plūrimās vōbīs agō
>
> Andrew Gollan
> "bis vincit qui se vincit"
> Latin - Henry Clay HS
>