[XeTeX] how to do (better) searchable PDFs in xelatex?

Jonathan Kew jfkthame at googlemail.com
Mon Oct 15 16:59:44 CEST 2012

On 15/10/12 15:19, Peter Baker wrote:
> Here's an example file:
> %&program=xelatex
> %&encoding=UTF-8 Unicode
> \documentclass{book}
> \usepackage[silent]{fontspec}
> \usepackage{xltxtra}
> \setromanfont{Junicode}
> \begin{document}
> \noindent You can search for these:
> \noindent first flat office afflict\\
> \noindent But you cannot search for these:
> \noindent after fifty front\\
> \noindent You can search for these words because small caps have been
> moved out
> of the PUA in recent versions of Junicode:
> \noindent\textsc{first flat office afflict after fifty front}
> \end{document}
> Here's a link to an uncompressed (using pdftk) PDF:
> https://dl.dropbox.com/u/35611549/test_uncompressed.pdf
> I honestly have no idea what I'm looking at when I open that in Emacs.
> Here is info about the Junicode ligatures that can't be searched:
> glyph name f_t, encoding U+EECB
> glyph name f_t_y, encoding U+EED0
> glyph name f_r, encoding U+EECA
That's exactly the problem - these glyphs are encoded at PUA codepoints, 
so that's what (most) tools will give you as the corresponding character 
data. If they were unencoded, (some) tools would use the glyph names to 
infer the relevant characters, which would work better.

> Small caps are named like "a.sc" and they are unencoded.
And as they're unencoded, (some) tools will look at the glyph name and 
map it to the appropriate character.

> The font is
> generated by FontForge. The PDF is generated by XeTeX (XeLaTeX
> actually). I don't know if another program (e.g. LuaTeX) would yield
> different results.
> Peter
> On 10/14/12 10:56 PM, Ross Moore wrote:
> > Any chance of providing example PDFs of this? (preferably using
> > uncompressed streams, to more easily examine the raw PDF content) Do
> > the documents also have CMap resources for the fonts, or is the sole
> > means of identifying the meaning of the ligature characters coming
> > from their names only? Have these difficulties been reported to Adobe
> > recently? If not, would you mind me doing so?
