[XeTeX] Re: New feature request for XeTeX

Mon Jul 26 17:22:23 CEST 2004

Hi Jonathan,

Thanks for replying already.

On 26/07/2004, at 9:43 PM, Jonathan Kew wrote:

> Hi Ross,
>
> Interesting idea; I'll think about it a bit. Some initial comments 
> below....

Yes, it takes some thinking about; it's not a small thing.

>
>> What I have in mind could be called a "font mapping",
>> or "encoding mapping".
>> This would be applied on output, whenever a particular font-variant
>> is being used, both when obtaining the size of a sequence of
>> letters/glyphs and in the final output.
>
> (Just an aside: seems to me that this is closer to the input side of 
> things than the output.)

No; it isn't really on the input side...

> In effect, you're asking for the ability to apply a custom mapping 
> from 7-bit codes to Unicode values, specified as part of the \font 
> declaration. Is that roughly the idea?
>
>> What I want to do is to be able to continue to use the old TeX 
>> sources,
>> but to end up with the Unicode code-points in the output
>
> In principle, of course, you can do this by \catcoding the custom 
> input characters as active chars, and then \defining them as macros 
> that generate the appropriate Unicode codepoints. But that gets really 
> cumbersome if you need lots of them, and also need to use the same 
> (input) characters within control sequences, etc.

  ... or implementable with \catcodes. At least, that isn't going to work
with legacy documents, in particular with user-defined macro 
definitions.

Consider this example, still using the case of  tipa.sty  where every
uppercase letter has to be redefined.

\def\mypet{CAT}
{\rm \mypet}    % use an ordinary roman font
{\tipafont\tipacatcodes \mypet}  % use the special font, and catcodes ?

This will *not* result in active characters for the C A T, since they
were not active at the time the macro was defined.

On the other hand, if you try:

{\tipacatcodes \gdef\mypet{CAT}}
{\rm \mypet}    % use an ordinary roman font
{\tipafont \mypet}  % use the special font

then the \mypet expands into active characters, etc.
for the roman font, *as well as* for the \tipafont .
That's again not what is wanted.

Of course there is also the problem that the names of macros
cannot contain any of the characters that have been made active;
else you
  (i) cannot call the macro while the catcodes are active;
  (ii) have to jump though hoops to even get the macro defined
       at all, with active characters in the expansion.

Another possible approach is to define the  \tipafont  macro
to start a "parser" that examines each token in the environment.
This will work if there are only letters and non-active characters.
However, if there are macros to be expanded, then the parser needs
to be called on the result of the expansion and afterwards.
This can be quite tricky to define, especially when the macros can
take arguments and expand into other macros.

I think that the difficulty of programming via macros means
that it would be much easier to have the mapping occurring
*after* macro-expansion, when you know exactly what characters
the (La)TeX job is trying to place onto the page.

>
>> Having a font-mapping available, there is an obvious way forward,
>> that would alleviate completely the need for pre-processors, and
>> be easier for a user to configure for his/her own needs.
>>
>> The \textipa command would cause the mapping to be applied to the
>> result (*after* macro-expansions) of the usual processing of its
>> argument.
>> Currently \textipa sets the font encoding to T3 for its processing,
>> and switches font to have the correct metrics available.
>
> Not having touched LaTeX in many years (I used 2.09 once upon a time), 
> I don't know how input encodings, font encodings, etc. are handled by 
> current LaTeX packages. What does "sets the font encoding to T3" 
> actually do, in TeX terms?

In TeX terms, a font encoding is just a name, as a string of letters.
Conceptually, this name is associated with a listing of the meanings
of the characters in each \char position for one or more fonts.

The font encoding is used by LaTeX to determine the name of a file to
load, which determines how to construct appropriate \font commands.
These \font commands use names for the TFMs that correspond to the
encoding.
(e.g.  Palatino with OT1 7-bit encoding has the name  pplr7t.tfm
     whereas with T1 8-bit encoding it has name  pplr8t.tfm )

TeX names for the font-switching macros are constructed also for
styles (\rm, \it, \bf, etc.) and sizes (\large, \Large, etc.).

Modern LaTeX also uses the encoding name to affect the expansion
of some commands. That is, a macro can give different results
depending on the current font encoding.
Accents are a good example of this.

For example, in the *same* document I could use fonts and encodings,
such as:
   cmr  encoded as OT1   (7-bit ascii, old-style TeX encoding)
   Palatino  encoded T1  (8-bit, newer TeX encoding, similar to latin1 )
   Lucida Grande  encoded U  (unknown, or Unicode)
           (assumed to be using XeLaTeX, of course)

With appropriate selection of the font,
the result of  \'e  could be quite different:

   OT1:  construct a box containing ' placed over e
   T1:   place the 8-bit character  ^^e9
    U:   place the 16-bit character ^^^^00E9

In TeX terms, \' expands in a complicated way that involves
a macro whose name depends on the encoding.
The 3 different cases end up expanding different macros,
so can give different results.

Indeed the structure is sufficiently complex that under the same
encoding it is possible to specify what happens to different
characters. (That is, \'o  can be processed in a different way
to how \'e  is processed.)

>
>>
>> With XeTeX, the change would be to an AAT font; but there needs
>> to be an appropriate mapping either:
>>
>>   (i) of 7-bit characters directly to Unicode points;
>> or
>>   (ii) of 7-bit characters to (La)TeX macros, and further processing
>>       to get the Unicode points.
>>
>> The latter would be more flexible, I think -- though perhaps harder
>> to integrate seamlessly into TeX's workflow.
>
> So for (i) do you envisage an extension to the \font command, perhaps 
> something like:
>
> 	\font\tipalucida="Lucida Grande:mapping=tipa" at 10pt
>
> which would load an external file such as "tipa.xmp" (the extension 
> .map is used for too many things already!), containing a mapping to be 
> applied to character codes when using this font.

Yes. Something like this would be fine.
Of course the mapping is optional, like other variant specifiers.

>
> Or should the mapping be defined entirely at the TeX source level?

No. That would be inflexible.
If the TeX engine didn't have the mapping pre-compiled,
then it couldn't be used.

>
> Your (ii) seems to me to be a major extension, and probably hard to 
> design and do well. After all, we don't have a sequence of character 
> codes in a specific font until *after* macro processing; at that 
> point, you envisage replacing characters with macros and 
> re-processing? Could this process recurse indefinitely?

Well, this option was meant as "pushing the boundaries" for what could 
be achieved
using this idea.

Certainly there would have to be some limitations on what kind of 
macros could
be used as the replacements for mapped character tokens.
Otherwise there could be recursions --- as is currently possible in TeX 
anyway.

> It sounds like this could become an extension on a similar scale to 
> Omega's OTPs and all that stuff, and I don't want to get into 
> designing something of that scope.

I see what you mean, when taken to full generality.
However, a simple mapping of type (i) alone would suffice for many 
things.
It would act like 1 pass through a single OTP, done at the last 
possible moment.

>
> If you need something more than a simple character code remapping for 
> certain characters, perhaps those instances could be handled as 
> \active characters in TeX, while the "mapping=...." option would allow 
> you to remap the majority of simple characters without having to 
> \activate just about everything. Reasonable compromise?

Sure.

It'll take some experimenting to determine how best to handle ligatures
and hyphenations.

>
>> As for mathematics, this would make it *much, much easier* to get
>> consistent styles for mathematics in a document using an AAT 
>> text-font.
>> This is because there are already code-points for slanted/italiced
>> math-characters, math symbols, extension-brackets, fraktur, etc.
>> Appropriate font-mappings for cmmi, cmsy, cmex  would be easy to 
>> write.
>> (Even some super/subscripts can be supported without changing 
>> font-size!)
>
> I suspect this will prove much harder than you think.

That's why I formulated the request in terms of text fonts first.
This is despite being a mathematician myself. :-)

>
> TeX relies heavily on special metrics in the TFM file to control math 
> typesetting, and when XeTeX loads and uses an AAT font there is no TFM 
> file involved! It measures runs of text by calling ATSUI, but that 
> only provides the basic width of a character sequence; it doesn't have 
> per-character height, depth, and italic correction.

This is knowable, in principle...

> And there's no place for XeTeX to get the "extensible recipe" used 
> when constructing large delimiters, etc.
>
> So this font-mapping mechanism could give you easier access to simple 
> characters in Unicode fonts (while keeping source text in legacy 
> "hacked ASCII" encodings), but I'm doubtful that it would enable you 
> to replace cmex with a Unicode version. For that, I think we really 
> need a Unicode-based extension of the TFM file--which Omega has done, 
> hasn't it? But XeTeX doesn't currently read OFMs.

   ... once you have a place where the information can be found.

>
> Given the limitations, would this still be a worthwhile extension?

Even the simplest form of mapping would be quite useful.
Once that is in place, and it gets some usage, then we'll be able
to determine whether there's value in making it more sophisticated.

All the best,

	Ross

>
> Jonathan
>
> _______________________________________________
> XeTeX mailing list
> postmaster at tug.org
> http://tug.org/mailman/listinfo/xetex
>
------------------------------------------------------------------------
Ross Moore                                         ross at maths.mq.edu.au
Mathematics Department                             office: E7A-419
Macquarie University                               tel: +61 +2 9850 8955
Sydney, Australia                                  fax: +61 +2 9850 8114
------------------------------------------------------------------------