[XeTeX] PDFs and advanced font features

Thu Oct 28 16:18:33 CEST 2010

Hello!

I’m having a problem with the way the advanced font features of XeTeX
interact with PDF reader programs. I’m not exactly sure where exactly
is the culprit, so I apologize if this is not the right place to ask
for help; (re-)directions are welcome if such is the case.

I’ve been writing my CV (I think the more correct US term is resume)
in LaTeX, using xelatex to compile it to PDF. I managed to get it to
look pretty much exactly as I wanted. (I’m not quite a typography
expert, but I’m quite pleased with the result if I may say so.)

The document uses a nice font with many OpenType features like small
and titling capitals, lining and old-style numerals, and superscripts
and the like. (Those are the ones I use, there are others.) Therein
lies the problem: as far as I can tell “variant” characters, like
small-caps or superscript letters, are represented as additional
(private) code-points within the font, rather than as separate fonts.
For display and printing, this is not a problem: the font is embedded
in the PDF, and everywhere I tried it it seems to look as it should.

However, when copying and pasting the contents in another program—big
failure. Everything that isn’t displayed in the “normal” variant is
copied to the clipboard as a set of (what I believe to be) private
codepoints rather than the “semantic” Unicode codepoints it
represents.

This is a big problem for this document, as I expect a potential
employer might try to copy&paste parts of it (e.g., address) and fail
unexpectedly (getting gibberish).

I’ve tried searching for solutions or workarounds, with little
success. If (as I assume) this is a well-known problem, don’t hesitate
to just point me towards a document that explains it.

I’ve seen PDF documents that seemed to have a kind of “text overlay”:
these were all scanned documents with (I assume) some kind of OCR
processing. For display and printing purposes, only the scanned image
was used (i.e., the OCRed text was invisible). However, when selecting
(and copy/pasting), a text layer was used.

I’ve no idea what PDF feature this used and if it’s accessible via
LaTeX. I was hoping there was a way to add a “replacement” text for
affected areas (and I searched fruitlessly the hyperref documentation
for it), such that on copy-paste the replacement is used rather than
just private characters. Since it’s a one-page document it wouldn’t be
a lot of work to add the replacements.

The only alternative I could think of was to take FontForge and
manually split the font in pieces (e.g., one for small caps, one for
superscripts, etc.), such that each variant glyph is encoded in its
“semantic” position. But it’s a big and complex font, so that would
take a lot more work than just “hinting” the document. I also worry
that messing around with it in FontForge will cause me to loose
hinting and other features I (or it) may not be aware of.

I welcome all ideas, and thank you in advance.

--Bogdan Butnaru

PS. What I’m using identifies itself as “XeTeX 3.1415926-2.2-0.9995.2
(TeX Live 2009/Debian)” on Ubuntu. Fontspec reports itself as
“2008/08/09 v1.18”. The problem manifests itself on every PDF viewer I
tried (about one each for Linux, Windows and Mac OS X, and also Google
Docs’ viewer).