[XeTeX] Re: New feature request for XeTeX
jonathan_kew at sil.org
Tue Jul 27 15:03:48 CEST 2004
On 27 Jul 2004, at 12:10 pm, Ross Moore wrote:
> A many--1 or many--many map is certainly a bit harder.
> It requires:
> (i) the rules to be ordered (e.g. --- must be tested and applied
> before -- )
Longest-match-wins is the usual principle; that's what the TECkit
mapping engine does.
> (ii) proper integration with the hyphenation routine.
> For (ii) a definite rule is that there cannot be hyphenation between
> code-points returned by a *--many mapping replacement.
No, I envisage hyphenation being applied to the complete sequence of
characters that make up the word *after* the mapping has been applied.
So (to use Somadeva's example), you'd say something like
\font\transdev="Devanagari MT:mapping=Lat2Dev" at 12pt
...hindi text written in (Unicode) latin transliteration...
and this would be mapped via the Lat2Dev mapping into Unicode
Devanagari characters; and the hyphenation rules for the \hindi
language (defined in terms of Unicode Devanagari) would be used.
Incidentally--and this may catch people somewhat by surprise--the text
content of such words, when displayed by XeTeX (e.g., in "overfull box"
messages, or with \showbox) will be the characters *after* the
application of the mapping.
> Apart from (i) and (ii), I don't see much difficulty in this,
> and ligatures could then be handled very easily.
> Of course I'm not looking from the same view-point as you; so defer
> to your experience in programming this kind of thing for a TeX engine.
>> I'm also toying with ideas of more powerful mappings; it happens that
>> I have a character-mapping engine, TECkit (see
>> http://scripts.sil.org/teckit) that we could perhaps press into
>> service. TECkit supports many-to-many mappings,
>> contextually-determined mappings, even reordering of code sequences.
> Wow; that's more than I was requesting --- at this stage!
>> It was developed primarily to support complex mappings between legacy
>> byte encodings and Unicode, but can also operate as a transducer
>> entirely within Unicode, which is what would be needed here as XeTeX
>> will already have interpreted the input text as Unicode when it was
>> initially read from the file.
> Looking at your TECkit overview, you seem to have already addressed
> the kind of problem
> that I'm trying to solve. (No, I didn't already know of this work!)
Yes, it's addressing some of the same issues, but at a different level.
In principle, I'd be inclined to say that the TeX source documents
should be updated to use "proper" Unicode character encoding
throughout. But I realize this is a huge task and in many cases may be
impractical, at least for the time being, partly because of a need to
maintain compatibility with non-Unicode TeX (and other) systems. This
"font mapping" scheme may be a great way to allow legacy documents to
work better in a Unicode world.
More information about the XeTeX