[XeTeX] Res: small caps not searcheable

Jonathan Kew jfkthame at googlemail.com
Tue Aug 4 19:09:14 CEST 2009

On 4 Aug 2009, at 17:54, Adam Twardoch wrote:

> Jonathan Kew wrote:
>> My position is that what xetex and xdvipdfmx is doing here is  
>> correct.
>> XeTeX is determining which glyphs to use, by means of the requested
>> OpenType feature. That's its complete responsibility. In order to
>> enhance the usability of the PDF it creates (which would print fine
>> regardless), xdvipdfmx is creating a CMAP, and it is using the font's
>> encoding as its primary source to do this. (If there are unencoded
>> glyphs -- as the small caps *ought* to be -- it is supposed to fall  
>> back
>> on glyph names to try and determine the mapping.)
> I believe this is a rather simplistic approach. I believe at least an
> option, or even the default behavior, would be to back-track both the
> unencoded glyphs _as well as_ the glyphs encoded in the PUA to their
> "parent" codepoints by the means of reversing the OpenType Layout
> lookups. This is something Adobe have been doing in InDesign for a  
> long
> time. It's actually not that hard, either.
> I don't consider PUA mapping of glyphs that are otherwise accessible
> only through user-selectable OpenType Layout features a "bug". But I
> maintain that PUA should be considered the last resort source for
> implying the "text value" of a glyph stream by a PDF authoring
> application. Non-PUA codepoints should be primary, then "parent
> codepoints" obtained by reversing OTL lookups, then perhaps glyphnames
> and PUA codepoints only as the very last ones.

Reversing OTL lookups is a tricky business. Suppose a glyph can be  
reached via several different lookups, with different starting points;  
which character code should be chosen?

I'd be prepared to consider preferring glyph names over cmap  
codepoints when the codepoints are in the PUA, but I'm reluctant to  
get involved in trying to reverse OpenType processing.

With glyph names, there's still the issue that a given glyph may have  
been reached from more than one possible source character, but at  
least in this case the font developer has the opportunity to  
unambiguously choose which character is considered "primary". With  
GSUB-reversal, it's not at all clear how this would be done.

The other way to approach all this -- and in fact the only really  
reliable way, AFAICT -- would be for the originating application  
(xetex in this case) to annotate the glyph stream with /ActualText  
entries, so that the *real* source text is always available.  
Otherwise, even if we can find the "perfect" character mapping for  
every glyph, there will still be issues of character ordering (e.g.,  
when Indic vowels are moved around in the syllable).


More information about the XeTeX mailing list