[XeTeX] Hyphenation in Transliterated Sanskrit
zdenek.wagner at gmail.com
Mon Sep 12 10:31:36 CEST 2011
2011/9/12 Mojca Miklavec <mojca.miklavec.lists at gmail.com>:
> On Mon, Sep 12, 2011 at 09:36, Yves Codet wrote:
>> A question to specialists, Arthur and Mojca maybe :) Is it necessary to have two sets of hyphenation rules, one in NFC and one in NFD? Or, if hyphenation patterns are written in NFC, for instance, will they be applied correctly to a document written in NFD?
> That depends on engine.
> >From what I understand, XeTeX does normalize the input, so NFD should
> work fine. But I'm only speaking from memory based on Jonathan's talk
> at BachoTeX. I might be wrong. I'm not sure what LuaTeX does. If one
> doesn't write the code, it might be that no normalization will ever
> take place.
I am not an expert on Unicode and do not know what XeTeX does and
when. I made a test in Hindi when implementing sort rules in Xindy.
What I am speaking about is sample 4 available from
http://icebearsoft.euweb.cz/xindy-devanagari/ (this is what I
presented last year in Brejlov). Hindi makes use of characters with
nuktas. For instance, za can be entered as U+095B or as ja U+091C
followed by nukta U+093C. The latter can be found in the wordlist used
in aspell. In my sample the first page contains a few words where all
"nukte vale" characters are written directly, on the second page the
same words are written using nukta signs. The first index shows that
the \index macro wrote the input without any change, I had to use
merge rules in Xindy. I have not looked what was written to xdv and
now gedit does some strange things...
> I can also easily imagine that our patterns don't work with NFD input
> with Hyphenator.js. I'm not sure how patterns in Firefox or OpenOffice
> deal with normalization. I never tested that.
> But in my opinion engine *should* be capable of doing normalization.
> Else you can easily end up with exponential problem. A patterns with 3
> accented letters can easily result in 8 or even more duplicated
> patterns to cover all possible combinations of composed-or-decomposed
> Arthur had some plans to cover normalization in hyph-utf8, but I
> already hate the idea of duplicated apostrophe, let alone all
> duplications just for the sake of "stupid engines that don't
> understand unicode" :).
> Subscriptions, Archive, and List information, etc.:
More information about the XeTeX