[XeTeX] Greek Hyphenation (monotoniko)
jonathan_kew at sil.org
Mon Jan 9 15:19:29 CET 2006
On 9 Jan 2006, at 1:33 pm, Yves Codet wrote:
> Le 9 janv. 06, à 13:42, Jonathan Kew a écrit :
> I am trying my luck with Jonathan's exercise but I have a few
> questions about it.
>> * use U+02BC MODIFIER LETTER APOSTROPHE for the apostrophe
>> (elision), not ''
> Should U+02BC be used instead of U+2019 in Greek?
Probably, when it is functioning as an apostrophe/elision mark, which
is logically part of a word, as opposed to a punctuation mark
However, in practice I wouldn't be surprised if people use U+2019;
it's difficult to maintain such distinctions, when the characters
look the same. The fact that 2019 behaves as a punctuation mark,
while 02BC behaves as a letter, will be ignored by most people most
of the time!
> Is coronis the same character as smooth breathing?
Yes, I believe so.
>> * use U+2060 WORD JOINER as compound word mark, not the letter "v"
> I thought "v" was for digamma in Claudio Beccari's file :) What is
> the use of a compound word mark in Greek?
Oh! I was going by the comments at the top of the file (as I don't
really know anything about Greek). Also, would digamma typically be
found in modern Greek? I thought it was an archaic letter, so
wouldn't expect it to be included at all in a hyphenation file
intended for modern monotonic text.
>> One further issue to consider would be composed vs. decomposed
>> text; this file uses precomposed letters for the vowels with tonos
>> or dieresis, but these could also be encoded as sequences of vowel
>> + diacritic. So additional rules should be included to recognize
>> those forms as well. This is left as an exercise for the
>> reader.... :-)
> It is probably handier to encode them like that, unless your
> keyboard's width is two meters (for ancient Greek at least). Yes,
> you could use dead keys but it would take a while to create a
> layout. But my question is: if breathings, accents, diaeresis and
> iota subscript are not declared as letters (and they should not be,
> should they?), there is no need of rules prohibiting break before
> them. Am I right?
They'd better be declared as "letters" from the point of view of
TeX's hyphenation routine, which means they need to have catcode 11
or 12, and non-zero \lccode. Otherwise they'll break words up and
hyphenation won't be applied to the proper complete sequences.
So I think the right thing to do is to ensure \lccode<char> = <char>
for each of these diacritics, and include hyphenation rules for both
the precomposed and decomposed representations. (Remember that
regardless of which form you happen to use when you type, with the
particular keyboard layout you like to use, you might also get text
from other sources that uses a different encoding form. Or text that
you originally typed using combining diacritics might go through some
other process that applies NFC normalization.)
More information about the XeTeX