[XeTeX] UTF-8 hyphenation patterns for Greek
ycodet at club-internet.fr
Thu Jun 29 11:43:15 CEST 2006
Le 29 juin 06, à 01:34, Peter Heslin a écrit :
A few weeks ago, I said that I would write a script to convert Dimitrios
> Filippou's hyphenation patterns for Greek into UTF-8. I'll attach my
> first draft -- it has been very lightly tested for ancient Greek only.
> In addition to supporting normalization forms C and D, it also supports
> a number of different variants I've seen in the way people use Unicode
> to encode ancient Greek. If I have missed any scenarios, please let me
Thanks for that script. Only a question and two remarks about it:
--- Do \lccode commands need to be prefixed with \global (this question
is rather for Jonathan)?
--- It would be handy, for people wishing to revise those patterns, if
Greek words in comments were converted to Unicode too.
-e In addition to using the plain old "Apostrophe" (U+0027) to
indicate elision, support the use of a variety of other
apostrophe-like characters for this purpose. The supported
characters are "Right Single Quotation Mark" (U+2019 -- this is
the preferred usage), "Modifier Letter Apostrophe" (U+02BC),
"Greek Koronis" (U+1FBD), and "Greek Psili" (U+1FBF).
It's a good thing to allow several usages, but some of them seem to be
semantically wrong, and perhaps they should be "deprecated". Is U+2019
really recommended? It's a punctuation mark and Greek apostrophe is a
letter (the substitute of a letter). I would think the right choice is
U+02BC, but I may be mistaken.
More generally, if XeTeX is packaged with hyphenation patterns some
day, some (or most?) users will want a ready to use hyphenation file,
not a script allowing them to choose their preferred encoding, and many
might not know what to do with a script. It would be good to provide
such a file for Greek (and other languages), after an agreement about
its content, probably both decomposed and precomposed characters, and
after a revision, as I suggest below.
There could also be a decision about standard names for hyphenation
patterns. Robin Fairbairns had suggested me to use ISO three-letter
codes, as in sanhyph.tex, grchyph.tex, to avoid confusion with older
patterns, when they exist. It seems to be a good idea.
Instead of a simple conversion, hyphenation patterns could be revised.
That's why some time ago I said, in reply to a message sent by Hans,
that converted patterns would need some editing by hand; not that I
doubt that ctxtools do a good job, but because older patterns can be
improved. There would be a few things to discuss in the Unicode version
of GRAhyph4.tex. I've only read it quickly but I've noticed this:
2σ1δ 2ϲ1δ % Liddell-Scott lexicon: sde'ugla = ze'ugla, sd = z ???
Is it desirable to allow a break here, as σδ, σζ are equivalents of ζ
(incidentally the comment could be changed, since there's no doubt
about sd = z and there are more examples than the one which is quoted)?
I have no positive answer, I'm simply wondering. It's only an example;
there may be more points to discuss about those patterns before a
"standard" version is issued.
More information about the XeTeX