[XeTeX] Fwd: PDFs and advanced font features

Bogdan Butnaru bogdanb+xetex at gmail.com
Thu Oct 28 18:55:40 CEST 2010

A bit more information: as I suspected, the copy-paste is not *quite*
broken completely. When pasting the mangled text in a plain text
editor whose font contains the same variants (e.g., I just selected
the same font in Geany), the variant glyphs are displayed.

So XeTeX isn’t technically mangling anything: it just uses the
private-area codepoints the font uses to encode the small caps/titling
caps/old-style digits. I just need it to stop doing that, somehow —
or, more specifically, to also remember and note in the PDF what those
characters mean.

I notice that, for example, ligatures (“fi”, for instance), are
represented in the PDF by the ligature glyph that exists in the font
(I suppose it’s selected contextually), but copied they yield the
“semantic” sequence of characters (e.g., an “f” and an “i”).

I did a bit of testing and it seems that a few other features are
affected. For example, searching for “f” *does* work even if a
ligature was used (the entire ligature is highlighted, but the search
succeeds). However, searching for text in small caps doesn’t work, at
least not in Ubuntu’s Evince. This also applies for the XeTeX
reference manual I mentioned before: searching for “Hoefler” does find
& highlight the example in small caps on page 4, but searching for
“Warnock” doesn’t.

* * *

At this point in writing I realized that my fruitless searches have
all been about small-caps and variant glyphs. I tried searching for
ligature issues, and noticed this:

So theoretically the basics exist. I’ve tried accsupp, and it doesn’t
quite seem to work. I only have access to Evince for now, so I can’t
test in other PDF viewers; thus, the following may be its fault rather
than either XeTeX’s or accsupp’s:

1) The package does seem to do *something*. The parts I marked with
ActualText tags are no longer selectable in the document.
2) However, a select-all *does* select pieces of text that correspond
in size to the ActualText I typed. These pieces, however:
      a) are placed in seemingly random parts of the page
      b) (when selected) they display just box characters (a
character-shaped square box with diagonals crossed; not sure if it
comes from Evince or the font; the document font has a “.notdef”
character that has the same shape, but I can’t tell if it’s exactly
the same).
       c) when copied and pasted in a text editors, the result is
partly correct (i.e., the correct, “normal” characters get pasted),
but their text gets inserted randomly between the rest of the text.
(The position seems to correspond to the position of the boxes on the

I experimented a bit with accsupp’s options, but I don’t find anything
relevant to these problems.

I welcome any help!

-- Bogdan Butnaru

On Thu, Oct 28, 2010 at 17:05, Bogdan Butnaru <bogdanb+xetex at gmail.com> wrote:
> If you need an example of the problem, see the XeTeX manual at
> http://tug.ctan.org/tex-archive/info/xetexref/XeTeX-reference.pdf
> On page 4 there are two examples of small caps usage. On my computer,
> at least, the first one (Warnock Pro in italic+small caps) cannot be
> copied correctly. The second example (in Hoefler Text, bold+small
> caps) however does work. I suspect Hoefler Text uses a different font
> file for the small caps rather than feature tags in a font with normal
> minuscules.
> --Bogdan Butnaru
> On Thu, Oct 28, 2010 at 16:18, Bogdan Butnaru <bogdanb+xetex at gmail.com> wrote:
>> Hello!
>> I’m having a problem with the way the advanced font features of XeTeX
>> interact with PDF reader programs. I’m not exactly sure where exactly
>> is the culprit, so I apologize if this is not the right place to ask
>> for help; (re-)directions are welcome if such is the case.
>> I’ve been writing my CV (I think the more correct US term is resume)
>> in LaTeX, using xelatex to compile it to PDF. I managed to get it to
>> look pretty much exactly as I wanted. (I’m not quite a typography
>> expert, but I’m quite pleased with the result if I may say so.)
>> The document uses a nice font with many OpenType features like small
>> and titling capitals, lining and old-style numerals, and superscripts
>> and the like. (Those are the ones I use, there are others.) Therein
>> lies the problem: as far as I can tell “variant” characters, like
>> small-caps or superscript letters, are represented as additional
>> (private) code-points within the font, rather than as separate fonts.
>> For display and printing, this is not a problem: the font is embedded
>> in the PDF, and everywhere I tried it it seems to look as it should.
>> However, when copying and pasting the contents in another program—big
>> failure. Everything that isn’t displayed in the “normal” variant is
>> copied to the clipboard as a set of (what I believe to be) private
>> codepoints rather than the “semantic” Unicode codepoints it
>> represents.
>> This is a big problem for this document, as I expect a potential
>> employer might try to copy&paste parts of it (e.g., address) and fail
>> unexpectedly (getting gibberish).
>> I’ve tried searching for solutions or workarounds, with little
>> success. If (as I assume) this is a well-known problem, don’t hesitate
>> to just point me towards a document that explains it.
>> I’ve seen PDF documents that seemed to have a kind of “text overlay”:
>> these were all scanned documents with (I assume) some kind of OCR
>> processing. For display and printing purposes, only the scanned image
>> was used (i.e., the OCRed text was invisible). However, when selecting
>> (and copy/pasting), a text layer was used.
>> I’ve no idea what PDF feature this used and if it’s accessible via
>> LaTeX. I was hoping there was a way to add a “replacement” text for
>> affected areas (and I searched fruitlessly the hyperref documentation
>> for it), such that on copy-paste the replacement is used rather than
>> just private characters. Since it’s a one-page document it wouldn’t be
>> a lot of work to add the replacements.
>> The only alternative I could think of was to take FontForge and
>> manually split the font in pieces (e.g., one for small caps, one for
>> superscripts, etc.), such that each variant glyph is encoded in its
>> “semantic” position. But it’s a big and complex font, so that would
>> take a lot more work than just “hinting” the document. I also worry
>> that messing around with it in FontForge will cause me to loose
>> hinting and other features I (or it) may not be aware of.
>> I welcome all ideas, and thank you in advance.
>> --Bogdan Butnaru
>> PS. What I’m using identifies itself as “XeTeX 3.1415926-2.2-0.9995.2
>> (TeX Live 2009/Debian)” on Ubuntu. Fontspec reports itself as
>> “2008/08/09 v1.18”. The problem manifests itself on every PDF viewer I
>> tried (about one each for Linux, Windows and Mac OS X, and also Google
>> Docs’ viewer).

More information about the XeTeX mailing list