[XeTeX] PDFs and advanced font features

Fri Oct 29 17:38:07 CEST 2010

I'm no expert, but I think any attempt to copy and paste special characters from PDF is
doomed. The searchable image you have seen is an optional output from
Acrobat Professional's OCR processing. It has no chance on earth of
OCR-ing ligatures and swashes, and wouldn't know a small cap if one
jumped up and bit it.

I was playing with Garamond Premier Pro and Mac's Character Viewer
trying to find the code points for an ft ligature that XeTeX found and
used to my delight. As you suspect, it was up in the private code page
area of Unicode, a free-for-all jungle where in another font, an
entirely different glyph will share the same codepoint.

Since you are resigned to recruiting agencies and others mangling your
beautiful work, and since your aim is to get your CV in front of a
competing horde of generous employers, you better admit your role is
not to teach recruiters typography. Send them your beautiful PDF by
all means, but also send a plain text document without ligatures or
other typographic niceties so that it is as easy as possible for them
to prepare the documents they send to prospects. If you make it hard
for them, your CV will go in the rubbish bin.

Sadly, the best way to hawk a CV to recruiters is to do it in
Word. Default styles, default fonts and no hand formatting beyond an
emphasis character style. They are going to copy then "paste special" to
give it their corporate image whether you like it or not.

It's ugly out there.

On 28 Oct 2010, at 15:18, Bogdan Butnaru wrote:

> Hello!
> 
> I’m having a problem with the way the advanced font features of XeTeX
> interact with PDF reader programs. I’m not exactly sure where exactly
> is the culprit, so I apologize if this is not the right place to ask
> for help; (re-)directions are welcome if such is the case.
> 
> I’ve been writing my CV (I think the more correct US term is resume)
> in LaTeX, using xelatex to compile it to PDF. I managed to get it to
> look pretty much exactly as I wanted. (I’m not quite a typography
> expert, but I’m quite pleased with the result if I may say so.)
> 
> The document uses a nice font with many OpenType features like small
> and titling capitals, lining and old-style numerals, and superscripts
> and the like. (Those are the ones I use, there are others.) Therein
> lies the problem: as far as I can tell “variant” characters, like
> small-caps or superscript letters, are represented as additional
> (private) code-points within the font, rather than as separate fonts.
> For display and printing, this is not a problem: the font is embedded
> in the PDF, and everywhere I tried it it seems to look as it should.
> 
> However, when copying and pasting the contents in another program—big
> failure. Everything that isn’t displayed in the “normal” variant is
> copied to the clipboard as a set of (what I believe to be) private
> codepoints rather than the “semantic” Unicode codepoints it
> represents.
> 
> This is a big problem for this document, as I expect a potential
> employer might try to copy&paste parts of it (e.g., address) and fail
> unexpectedly (getting gibberish).
> 
> I’ve tried searching for solutions or workarounds, with little
> success. If (as I assume) this is a well-known problem, don’t hesitate
> to just point me towards a document that explains it.
> 
> I’ve seen PDF documents that seemed to have a kind of “text overlay”:
> these were all scanned documents with (I assume) some kind of OCR
> processing. For display and printing purposes, only the scanned image
> was used (i.e., the OCRed text was invisible). However, when selecting
> (and copy/pasting), a text layer was used.
> 
> I’ve no idea what PDF feature this used and if it’s accessible via
> LaTeX. I was hoping there was a way to add a “replacement” text for
> affected areas (and I searched fruitlessly the hyperref documentation
> for it), such that on copy-paste the replacement is used rather than
> just private characters. Since it’s a one-page document it wouldn’t be
> a lot of work to add the replacements.
> 
> The only alternative I could think of was to take FontForge and
> manually split the font in pieces (e.g., one for small caps, one for
> superscripts, etc.), such that each variant glyph is encoded in its
> “semantic” position. But it’s a big and complex font, so that would
> take a lot more work than just “hinting” the document. I also worry
> that messing around with it in FontForge will cause me to loose
> hinting and other features I (or it) may not be aware of.
> 
> I welcome all ideas, and thank you in advance.
> 
> --Bogdan Butnaru
> 
> PS. What I’m using identifies itself as “XeTeX 3.1415926-2.2-0.9995.2
> (TeX Live 2009/Debian)” on Ubuntu. Fontspec reports itself as
> “2008/08/09 v1.18”. The problem manifests itself on every PDF viewer I
> tried (about one each for Linux, Windows and Mac OS X, and also Google
> Docs’ viewer).
> 
> 
> 
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex

Elliott Roper
phone: +44 1663 747334
mobile +44 7796 171018
www.yrl.co.uk