[XeTeX] UTF-8 hyphenation patterns for Greek

Peter Heslin pj at heslin.eclipse.co.uk
Thu Jun 29 21:42:08 CEST 2006

Yves Codet <ycodet at club-internet.fr> writes:

> --- It would be handy, for people wishing to revise those patterns, if 
> Greek words in comments were converted to Unicode too.

No.  People should not revise the computer-generated output of the
script.  They should revise the input instead and re-run the script.
Some comments are preserved in the output only in order to aid in
debugging the script.  The whole point of writing a script is to avoid
forking the patterns downstream.  This way upstream bugfixes can easily
be integrated into the utf-8 patterns.

> It's a good thing to allow several usages, but some of them seem to be 
> semantically wrong, and perhaps they should be "deprecated". Is U+2019 
> really recommended? It's a punctuation mark and Greek apostrophe is a 
> letter (the substitute of a letter). I would think the right choice is 
> U+02BC, but I may be mistaken.

I have supported a number of usages which are officially deprecated by
Unicode, but which are much more widely used (and for some good reasons)
than the officially endorsed usage.  As for the mark of elision, the
officially endorsed character is U+2019, not U+02BC, but I see no reason
to be prescriptive when current usage is so varied.

> More generally, if XeTeX is packaged with hyphenation patterns some 
> day, some (or most?) users will want a ready to use hyphenation file, 
> not a script allowing them to choose their preferred encoding, and many 
> might not know what to do with a script. It would be good to provide 
> such a file for Greek (and other languages), after an agreement about 
> its content, probably both decomposed and precomposed characters, and 
> after a revision, as I suggest below.

Of course, the end result to be distributed to users should be a pattern
file, not a script.  But there are important reasons (upstream bugfixes)
why it is desirable to provide this to XeTeX developers as a script,
rather than to do the conversion manually.

There are also other users who might want to use the script for
different purposes.  For example, ConTeXt currently only supports
precomposed characters, and it may not want to use combining diacritics.
This is why I made the behavior configurable.

> There could also be a decision about standard names for hyphenation 
> patterns. Robin Fairbairns had suggested me to use ISO three-letter 
> codes, as in sanhyph.tex, grchyph.tex, to avoid confusion with older 
> patterns, when they exist. It seems to be a good idea.

As has been said, the decision on where to put and what to call utf-8
hyphenation patterns for TeX is one that probably requires the input of
a wider audience than this mailing list.

> Instead of a simple conversion, hyphenation patterns could be revised. 

If you want to waste your time "revising" (i.e. making random, erroneous
changes to) the hyphenation patterns that Dimitrios Filippou has put
together with enormous care and precision, then feel free to do so.  Do
not expect others to help.

> That's why some time ago I said, in reply to a message sent by Hans, 
> that converted patterns would need some editing by hand; not that I 
> doubt that ctxtools do a good job, but because older patterns can be 
> improved. 

Filippou's "older patterns" are in fact quite new.  The most recent
revision of elhyphen (version 4) is dated Aug. 16, 2004.  The simplistic
Babel patterns you prefer (grhyph.tex) have a date of 1997 on them on my

> There would be a few things to discuss in the Unicode version 
> of GRAhyph4.tex. I've only read it quickly but I've noticed this:
> 	2σ1δ 2ϲ1δ   % Liddell-Scott lexicon: sde'ugla = ze'ugla, sd = z ???
> 	2σ1ζ 2ϲ1ζ
> Is it desirable to allow a break here, as σδ, σζ are equivalents of ζ 
> (incidentally the comment could be changed, since there's no doubt 
> about sd = z and there are more examples than the one which is quoted)? 
> I have no positive answer, I'm simply wondering. It's only an example; 
> there may be more points to discuss about those patterns before a 
> "standard" version is issued.

You have misinterpreted Filippou's ???  sign, because you have not seen
the notice at the top of his original file:

    % Some doubtful patterns are marked by three question marks "???". 

Thus his question marks do not call into question the occasional
equivalence of σδ and ζ, which is what you imply he means ("there is no
doubt").  Rather they indicate that the pattern itself is doubtful,
precisely on account of that occasional equivalence.  While it is true
that a writer, for example in the Aeolic dialect, might occasionally
write σδ as a variant for ζ, the *vast* majority of places where σ is
followed by δ is at the junction of elements in compound words, where
there should be a hyphenation point.

In general, if you want to suggest changes to Filippou's patterns, you
should communicate these to him directly, on the basis of his 8-bit
patterns, not the utf-8 conversion.  I don't think this mailing list is
the appropriate place to discuss the minute details of Greek

The question here is whether XeTeX will eventually include patterns
based on Filippou's, which are approximately correct and are based on
the official rules for Greek hyphenation established by the Academy of
Athens in 1939, or whether it will include patterns based on the
erroneous and misconceived Babel patterns.

Peter Heslin (http://www.dur.ac.uk/p.j.heslin)

More information about the XeTeX mailing list