[XeTeX] Fontspec question

Jonathan Kew jonathan_kew at sil.org
Thu Sep 7 14:19:21 CEST 2006

On 7 Sep 2006, at 1:01 pm, Ralf Stubner wrote:

> Peter Dyballa <Peter_Dyballa at Web.DE> writes:
>> On a side mark, Will: could you add \usepackage{cmap} to your TeX
>> source of fontspec?
> cmap.sty is specific for pdfTeX in PDF-mode with non-virtual fonts.  
> So I
> doubt it would help here. Recently pdfTeX has aquired an automatic  
> generator, which uses the glyph names as basis. Similar things  
> exist in
> (x)dvipdfmx.

Right; it attempts to synthesize CMap resources based on glyph names  
in OpenType/TrueType fonts.

With xdv2pdf (which Peter may have been using), I have no real  
control over what happens, as it's all handled by Apple's Quartz  
(CoreGraphics) framework. It often seems to work pretty well, but  
there may well be cases that aren't handled.

> For XeTeX it would of course be best if a CMAP where
> generated based on the Unicode /input/, since that would work even  
> when
> glyphames are wrong or missing.

However, the whole business of extracting text/searching/etc in PDF  
files based on CMap resources is a mess, and my advice would be to  
regard PDF as a medium for viewing and printing, not for text data  
exchange. The stream of glyphs present in the PDF may have very  
complex relationships to the underlying Unicode text -- consider, for  
example, Indic scripts where there is extensive reordering of  
elements within the syllable. As I understand it, to search for  
"hindi" in a PDF with Acrobat, you'd effectively have to type "ihndi"  
as the search string (and that's just a small example; it gets much  

Sure, it's nice (especially for plain English text) when copy/paste  
and text search give you a good approximation of what you'd expect,  
but until there's a (widely-supported) way to "annotate" the glyph  
stream in the PDF with the associated Unicode text, rather than  
attempting to recover Unicode characters from the actual sequence of  
glyphs, it will never really be universal and reliable. The character- 
to-glyph process is not fully reversible; there's too much complexity  
and potential ambiguity in the mappings and transformations.


More information about the XeTeX mailing list