[XeTeX] Res: small caps not searcheable

Jonathan Kew jfkthame at googlemail.com
Tue Aug 4 18:18:14 CEST 2009


On 4 Aug 2009, at 16:32, David Perry wrote:

> I am very eager to get a handle on what's happening here so I  
> understand how XeLaTeX and fontspec handle small capitals.
>
> I assumed -- wrongly -- that typing \textsc{} activated the true  
> Opentype or AAT small capital feature, if the font contained it and  
> one was using fontspec.

No, you assumed *rightly* .... that's exactly what fontspec does.

>  I just did a quick test, copying some text from a PDF that I made  
> using Linux Libertine (which contains OT smcp features).  When I  
> pasted the text into Word I got the PUA smallcaps, not true caps (or  
> even lowercase letters).  Flavio's experience with Minio Pro (older  
> version that has the PUA assignments) bears this out.  As I think  
> about it now, though, it does make sense to me that XeTeX would  
> behave this way.

XeTeX and fontspec don't know anything about the PUA codes, they  
simply apply the OpenType feature and use the resulting glyphs.

The reason you see the small cap glyphs as PUA codes when you try to  
search (or copy) in the PDF is that xdvipdfmx automatically creates a  
CMAP resource, to provide the mapping from glyphs to Unicode  
codepoints (otherwise they wouldn't be searchable/copyable at all).  
But to create this, it (quite reasonably) relies primarily on the cmap  
table of the font; and the font (quite wrongly) maps PUA codepoints to  
these glyphs.

I expect this is a legacy of the days before Adobe had any OpenType  
feature support in their apps; they wanted to make the small caps  
accessible, so they encoded them in the PUA, and users (or a script/ 
macro/whatever) could use those character codes directly. But in the  
world of Unicode and OpenType, that is the wrong approach, as  
confirmed by the fact that they've stopped doing this in newer fonts.

My position is that what xetex and xdvipdfmx is doing here is correct.  
XeTeX is determining which glyphs to use, by means of the requested  
OpenType feature. That's its complete responsibility. In order to  
enhance the usability of the PDF it creates (which would print fine  
regardless), xdvipdfmx is creating a CMAP, and it is using the font's  
encoding as its primary source to do this. (If there are unencoded  
glyphs -- as the small caps *ought* to be -- it is supposed to fall  
back on glyph names to try and determine the mapping.) The "error" is  
in the font, which claims that the small cap glyphs correspond to PUA  
codepoints.

>
> As soon as I can, I will test a XeLaTeX document, using fontspec to  
> specifically call OT small caps.

That won't make any difference.

>  I do not own any of the newer Adobe fonts where they have removed  
> the PUA values from the small caps.  Could anyone try one of those  
> fonts and let us know what happens?  I'm guessing that using  
> \textsc{} won't work at all.

No, provided the font has the OpenType small-caps feature, fontspec  
should configure it to work with \textsc.

An interesting question is whether the resulting PDF is searchable.  
Anyone care to try it and let us know? (The answer may depend on  
whether the glyphs are named in accordance with the Adobe naming  
guidelines. I'd hope this is the case for Adobe fonts, at least!)

>
> Asking users to search using the "faulty" PUA values is not a  
> realistic option--that why we need OT/AAT features.  For that reason  
> I wouldn't start messing with cmaps, which are not for the faint of  
> heart.

Providing the "right" CMAP is the only way to solve this. I'm not sure  
whether this can be done via the LaTeX cmap package or something  
similar; note that the issue is *not* that there is no CMAP, but that  
it contains mappings to the PUA codepoints (because that's where the  
font encodes these glyphs). I don't know whether a separately-added  
CMAP will be able to override this or not.

>  (I'm not sure I agree fonts that use the small cap PUA values are  
> faulty, but that's a whole different discussion.)

IMO, such fonts are wrong, or at least reflect a poor implementation  
choice.

JK

>
> David
>
> Flavio Costa wrote:
>> Hi Peter,
>> "In the latter case the letters come from the Basic Latin block of  
>> Unicode, in the former they are taken from the PUA, the Private Use  
>> Area, where Junicode, Cardo, Caslon ... encode (or save) their real  
>> small capitals."
>> Do you know why Computer Modern works as expected?
>> "If you want to be able to search small capitals, then use only the  
>> faulty ones. Or add a CMAP that maps them into the Basic Latin (or  
>> some other appropriate) block."
>> What do you mean by "use only the faulty ones"? From what I've been  
>> reading yesterday, new Adobe font do not encode small caps in the  
>> PUA anymore, they make it accessible only via OpenType Layout  
>> features. Since Minion Pro have its small caps in the PUA, adding a  
>> cmap may be a good option. Unfortunately I don't know how to do it...
>> I just found the cmap package:
>> http://tug.ctan.org/tex-archive/macros/latex/contrib/cmap/
>> However, I'm not sure it works with XeLaTeX, maybe it does by  
>> adding a <encoding>.cmap?
>> Thanks for the answer,
>> Flavio Costa
>>       
>> ____________________________________________________________________________________
>> Veja quais são os assuntos do momento no Yahoo! +Buscados
>> http://br.maisbuscados.yahoo.com
>



More information about the XeTeX mailing list