[tex-hyphen] Latin Hyphenation when using utf8

Claudio Beccari claudio.beccari at gmail.com
Wed Jun 23 20:00:21 CEST 2010

Dear Andrew,
you already got good suggestions by Mojca and Arthur; they show the only 
path that you can follow if you want to mark with macrons and breves all 
the vowels of a Latin text.

Now the five lowercase and upper case vowels a, e, i, o, and u with 
macrons and breves have all hexadecimal UNICODE codes in the second 
UNICODE page (this means that the first byte is 01 for all of them and 
the second byte ranges from 00 to 6D-- from Ā = 0100 to ŭ = 016D -- I 
did not look up for ȳ, which appears to me a very rare "Latin" vowel; 
yes I know that in transliteration from Greek it might be used, but it 
is not Latin "proper").

Therefore, utf8x as an option for the inputnec package, and this may be 
OK, but the T1 option for the fontenc package does not work, because 
this encoding contains mostly selected glyphs useful for western 
languages, including the abreve for Romanian, but, if I did not miss any 
glyph, it does not contain all the 20 upper and lower case vowels with 
macrons and breves.

Therefore you have to use Xe(La)TeX or lua(La)TeX. The variants 
containing the letters La allow you to use the familiar LaTeX markup, 
but they allow to use all the bells and whistles those programs offer; I 
have no experience with Xe(La)TeX, but I have a minimal 
(infinitesimal?)  experience with Lua(La)TeX and I am surprised of the 
many things that can be done with this typesetting engine.

Another solution would be to redefine the original macros \= and \u, for 
the macron and the breve respectively, in such a way that before and 
after the marked vowel they insert a \allowhyphens declaration, a macro 
that is defined in babel and that introduces a zero-width blob of glue 
preceded by an infinite penalty; by so doing a word containing a marked 
vowel gets split into three pieces: (a) what precedes such vowel, (b) 
the marked wowel, and (c) what follows such vowel.  The fragments (a) 
and (c) get proper hyphenation, but the word is never split just before 
or just after the marked vowel. A possible break point is missed, but at 
least the rest is properly hyphenated.

But, AAAAARGH!!! catastrophe!!!! if you mark all the vowel of every 
word, this procedure fails miserably; this was the trick I used when the 
old TeX2.x was available; it used only 128 glyph fonts (the standard CM 
fonts) and it was almost impossible to typeset in a decent way texts 
containing sentences in languages such as Spanish, or Portuguese, or 
Catalan, and, most regrettably, French; let's not speak about Romanian.

At that time we already had foreign European students on Erasmus 
mobility, and they had to typeset their theses in our technical 
University in their home languages; I succeeded setting up different 
versions of LaTeX (one for each language, since at that time TeX could 
handle one language at a time) using the above trick; it worked pretty 
well with the Hiberian languages, that use many accents, but not more 
than one per word; it worked not so well with French, that may use 
several accents on a single word (for example, électricité); it worked 
definitely in a bad way in Romanian where I found even words with 5 
special national characters. In any case it was better than nothing. It 
was necessary to insert by hand several \- commands in order to adjust 
line breaking, but at least something more or less automatic was better 
than nothing. Fortunately enough TeX 3 came up at the beginning of the 
nineties  together with the DC (EC) T1 encoded fonts and all these 
problems vanished.


The language description files for Italian and Latin contain the 
definition of the double quote character " as an active character; when 
inserted inside a word between to "normal" letters (letters without 
diacritics) it inserts a discretionary break that does not forbid the 
hyphenation of both word fragments: it is used for introducing 
etymological break points where the patterns would simply produce 
phonetic break points. If the potential break point just precedes a 
special letter input to TeX with an "accent" macro (such as \= or \u, 
for example) then the discretionary break must be introduced with "|.

Your problem with ȳ might be solved with a simple macro, if the 
discretionary break should immediately precede the glyph:


But, let me ask a naïve question: Why would you spell every Latin word 
with the "longa" or "brevis" melodic poetic notation?

Today I believe that very few modern languages have semantic differences 
depending on the length of the syllables while missing rhythmic accents; 
in western Europe I think that the language they speak in Chekia is one 
of such remaining languages, but I might be completely wrong. No modern 
person could read Latin prose maintaining the longa and brevis duration 
of each syllable as it is marked with \= and \u.

But, you might say, in poetry this is very important; yes I agree, but I 
remember reading Virgil and other poets (and also the Greek tragedies) 
when I was in high school (I graduated 52 years ago) and there was no 
longa/brevis indication on any vowel; myself and my schoolmates had 
clear in mind the rhythmic differences of dactili and spondaei, the 
rhythmic differences between hexametri and pentametri, and rarely missed 
a correct metric division. At the final State exam, very difficult and 
very selective, at the end of high school, we were supposed to read 
Latin and Greek poetry with the rhythmics these poems required "at first 
sight", that means without trying in advance two or three times each 
verse. The only difficulty for us, and possibly for our instructors, was 
to respect the Greek tones, raising or lowering the voice, or, even 
worse, doing a vibrato with the long vowels marked with a 
perispomeni/circumflex. I don't claim we were infallible, but as 18 or 
19 year old teenagers we were all doing pretty well; we knew the rules 
and we could apply them at first sight. Then, why stressing the 
typographical capabilities of our beloved typesetting engines for 
something that may appear as useless to an ignorant as myself? (remember 
that although I had classical studies in high school, I am an engineer 
and spent most of my working years as a researcher and a professor of 
electronics in University).

Certainly a different approach is necessary in a dictionary or in a 
grammar, but in that case single words are marked with their longae and 
breves; even more important would be to mark the longae and breves on 
single words when philological considerations are being made, especially 
for discussing the derivation of modern Romance languages from their 
Latin father, taking into account also the substrate and the adstrate of 
the populations where Latin was forced upon by the conquerors. By 
knowing even few elements of romance philology, a person with even a 
basic latin education can understand most easily the Latin derived words 
of many official languages and their varieties; ... English included, 
since its Saxon bases are diluted in a large see of Latin based words 
introduced by the Normans in the XI century.

Forgive my intrusiveness, please :-)

Andrew Gollan wrote:
> grātiās plūrimās vōbīs agō
> Andrew Gollan
> "bis vincit qui se vincit"
> Latin - Henry Clay HS

More information about the tex-hyphen mailing list