[XeTeX] UTF-8 hyphenation patterns for Greek

Yves Codet ycodet at club-internet.fr
Thu Jun 29 11:43:15 CEST 2006


Le 29 juin 06, à 01:34, Peter Heslin a écrit :

A few weeks ago, I said that I would write a script to convert Dimitrios
> Filippou's hyphenation patterns for Greek into UTF-8.  I'll attach my
> first draft -- it has been very lightly tested for ancient Greek only.
> In addition to supporting normalization forms C and D, it also supports
> a number of different variants I've seen in the way people use Unicode
> to encode ancient Greek.  If I have missed any scenarios, please let me
> know.

Thanks for that script. Only a question and two remarks about it:

--- Do \lccode commands need to be prefixed with \global (this question 
is rather for Jonathan)?

--- It would be handy, for people wishing to revise those patterns, if 
Greek words in comments were converted to Unicode too.

  -e In addition to using the plain old "Apostrophe" (U+0027) to
        indicate elision, support the use of a variety of other
        apostrophe-like characters for this purpose.  The supported
        characters are "Right Single Quotation Mark" (U+2019 -- this is
        the preferred usage), "Modifier Letter Apostrophe" (U+02BC),
        "Greek Koronis" (U+1FBD), and "Greek Psili" (U+1FBF).

It's a good thing to allow several usages, but some of them seem to be 
semantically wrong, and perhaps they should be "deprecated". Is U+2019 
really recommended? It's a punctuation mark and Greek apostrophe is a 
letter (the substitute of a letter). I would think the right choice is 
U+02BC, but I may be mistaken.

More generally, if XeTeX is packaged with hyphenation patterns some 
day, some (or most?) users will want a ready to use hyphenation file, 
not a script allowing them to choose their preferred encoding, and many 
might not know what to do with a script. It would be good to provide 
such a file for Greek (and other languages), after an agreement about 
its content, probably both decomposed and precomposed characters, and 
after a revision, as I suggest below.

There could also be a decision about standard names for hyphenation 
patterns. Robin Fairbairns had suggested me to use ISO three-letter 
codes, as in sanhyph.tex, grchyph.tex, to avoid confusion with older 
patterns, when they exist. It seems to be a good idea.

Instead of a simple conversion, hyphenation patterns could be revised. 
That's why some time ago I said, in reply to a message sent by Hans, 
that converted patterns would need some editing by hand; not that I 
doubt that ctxtools do a good job, but because older patterns can be 
improved. There would be a few things to discuss in the Unicode version 
of GRAhyph4.tex. I've only read it quickly but I've noticed this:

	2σ1δ 2ϲ1δ   % Liddell-Scott lexicon: sde'ugla = ze'ugla, sd = z ???
	2σ1ζ 2ϲ1ζ

Is it desirable to allow a break here, as σδ, σζ are equivalents of ζ 
(incidentally the comment could be changed, since there's no doubt 
about sd = z and there are more examples than the one which is quoted)? 
I have no positive answer, I'm simply wondering. It's only an example; 
there may be more points to discuss about those patterns before a 
"standard" version is issued.

Best wishes,


More information about the XeTeX mailing list