[XeTeX] XeTeX for Linux and xdvipdfmx

Tue Jun 6 10:14:40 CEST 2006

Pablo Rodríguez <oinos at web.de> writes:

> PDF documents generated from XeLaTeX have fonts embedded many times
> (with characters duplicated in subset fonts) and some characters cannot
> be copied/extracted with pdftotext or Adobe Reader.

I have encountered a similar, maybe related issue. For text extraction
to work reliably it is useful for the PDF file to contain a cmap/
toUnicode table. (x)dvipdfmx when working on a dvi file generates such a
table based on the glyph names when usig a Type1 fonts. In particular,
when (x)dvipdfmx finds a glyph named <base>.<variant> and the unicode
position for a glyph named <base> is known, one will get <base> from
text extraction. Typical example would be small caps named 'a.sc' etc,
where text extraction would find 'a'.

When using xetex in conjunction with an OpenType font that no longer
works. If the small caps are encoded in the PUA (eg MinionPro),
xdvipdfmx seems to embed this into the toUnicode table. If the small
caps are unencoded (eg Palatino Linotype), xdvipdfmx gives warning
messages that for certain glyphs there is no unicode mapping available. 

I don't know what information is present in the xdv file. I assume it is
only informtation about glyphs, not about characters which xetex still
knows (after all, a.sc is accessed as 'a + smcp feature'). But maybe the
method with glyph names used by (x)dvipdfmx when working on dvi files
with Type1 fonts could be used here, too.

cheerio
ralf