[XeTeX] Re: New feature request for XeTeX

Jonathan Kew jonathan_kew at sil.org
Tue Jul 27 15:03:48 CEST 2004

On 27 Jul 2004, at 12:10 pm, Ross Moore wrote:

> A  many--1  or  many--many  map is certainly a bit harder.
> It requires:
>   (i) the rules to be ordered  (e.g.  ---  must be tested and applied 
> before  -- )

Longest-match-wins is the usual principle; that's what the TECkit 
mapping engine does.

>  (ii) proper integration with the hyphenation routine.
> For (ii) a definite rule is that there cannot be hyphenation between 
> the
> code-points returned by a  *--many  mapping replacement.

No, I envisage hyphenation being applied to the complete sequence of 
characters that make up the word *after* the mapping has been applied. 
So (to use Somadeva's example), you'd say something like

	\font\transdev="Devanagari MT:mapping=Lat2Dev" at 12pt
	\transdev \language=\hindi
	...hindi text written in (Unicode) latin transliteration...

and this would be mapped via the Lat2Dev mapping into Unicode 
Devanagari characters; and the hyphenation rules for the \hindi 
language (defined in terms of Unicode Devanagari) would be used.

Incidentally--and this may catch people somewhat by surprise--the text 
content of such words, when displayed by XeTeX (e.g., in "overfull box" 
messages, or with \showbox) will be the characters *after* the 
application of the mapping.

> Apart from (i) and (ii), I don't see much difficulty in this,
> and ligatures could then be handled very easily.
> Of course I'm not looking from the same view-point as you; so defer
> to your experience in programming this kind of thing for a TeX engine.
>> I'm also toying with ideas of more powerful mappings; it happens that 
>> I have a character-mapping engine, TECkit (see 
>> http://scripts.sil.org/teckit) that we could perhaps press into 
>> service. TECkit supports many-to-many mappings, 
>> contextually-determined mappings, even reordering of code sequences.
> Wow; that's more than I was requesting --- at this stage!
>> It was developed primarily to support complex mappings between legacy 
>> byte encodings and Unicode, but can also operate as a transducer 
>> entirely within Unicode, which is what would be needed here as XeTeX 
>> will already have interpreted the input text as Unicode when it was 
>> initially read from the file.
> Looking at your TECkit overview, you seem to have already addressed 
> the kind of problem
> that I'm trying to solve. (No, I didn't already know of this work!)

Yes, it's addressing some of the same issues, but at a different level. 
In principle, I'd be inclined to say that the TeX source documents 
should be updated to use "proper" Unicode character encoding 
throughout. But I realize this is a huge task and in many cases may be 
impractical, at least for the time being, partly because of a need to 
maintain compatibility with non-Unicode TeX (and other) systems. This 
"font mapping" scheme may be a great way to allow legacy documents to 
work better in a Unicode world.


More information about the XeTeX mailing list