[XeTeX] Hyphenation in Transliterated Sanskrit

Mon Sep 12 11:42:00 CEST 2011

On 12 Sep 2011, at 08:59, Mojca Miklavec wrote:

> On Mon, Sep 12, 2011 at 09:36, Yves Codet wrote:
>> Hello.
>> 
>> A question to specialists, Arthur and Mojca maybe :) Is it necessary to have two sets of hyphenation rules, one in NFC and one in NFD? Or, if hyphenation patterns are written in NFC, for instance, will they be applied correctly to a document written in NFD?
> 
> That depends on engine.
> 
>> From what I understand, XeTeX does normalize the input, so NFD should
> work fine. But I'm only speaking from memory based on Jonathan's talk
> at BachoTeX.

xetex will normalize text as it is being read from an input file IF the parameter \XeTeXinputnormalization is set to 1 (NFC) or 2 (NFD), but will leave it untouched if it's zero (which is the initial default).

Note that this would not affect character sequences that might be created in other ways than reading text files - e.g. you could still create unnormalized text within xetex via macros, etc.

Forcing "universal normalization" is hazardous because there are fonts that do not render the different normalization forms equally well, so users may have a specific reason for wanting to use a certain form. (This is, of course, a shortcoming of such fonts, but because this is the real world situation, I'm reluctant to switch on normalization by default in the engine.)

In principle, it seems desirable that the engine should deal with normalization "automatigally" when using hyphenation patterns, but this is not currently implemented.

Personally, I'd recommend the use of NFC as a "standard" in almost all situations, and suggest that pattern authors should operate on this assumption; support for non-NFC text may then be less-than-perfect, but I'd consider that a feature request for the engine(s) more than for the patterns.

> I might be wrong. I'm not sure what LuaTeX does. If one
> doesn't write the code, it might be that no normalization will ever
> take place.
> 
> I can also easily imagine that our patterns don't work with NFD input
> with Hyphenator.js. I'm not sure how patterns in Firefox or OpenOffice
> deal with normalization. I never tested that.
> 
> But in my opinion engine *should* be capable of doing normalization.
> Else you can easily end up with exponential problem. A patterns with 3
> accented letters can easily result in 8 or even more duplicated
> patterns to cover all possible combinations of composed-or-decomposed
> characters.
> 
> Arthur had some plans to cover normalization in hyph-utf8, but I
> already hate the idea of duplicated apostrophe,

That's a bit different, and hard to see how we could avoid it except via special-case code somewhere that "knows" to treat U+0027 and U+2019 as equivalent for certain purposes, even though they are NOT canonically equivalent characters and would not be touched by normalization.

IMO, the "duplicated apostrophe" case is something we have to live with because there are, in effect, two different orthographic conventions in use, and we want both to be supported. They're alternate spellings of the word, and so require separate patterns - just like we'd require for "colour" and "color", if we were trying to support both British and American conventions in a single set of patterns.

> let alone all
> duplications just for the sake of "stupid engines that don't
> understand unicode" :).

Yes, the engine should handle that. But it doesn't (unless you enable input normalization that matches your patterns).

JK