[XeTeX] how to do (better) searchable PDFs in xelatex?

Mon Oct 15 00:32:02 CEST 2012

It's all in the font, really. If an OT substitution results in a 
character from the font's PUA being inserted in the character stream 
(except for a few standard ligatures), then the result will be broken 
searches. Because of this, modern fonts (including those from Adobe) are 
avoiding the PUA and placing the targets of OT substitutions in 
unencoded slots with names that enable searches (like "Q.alt", "Q_u").

My advice is to seek out fonts that avoid the PUA as much as possible 
(at least for standard features like entries in the "liga" table), and 
lobby the makers of fonts such as Libertine to start avoiding it as 
well. In the Libertine "liga" table, the pair "Qu" produces a ligature 
named "Q_u" that is at location U+E048 in the PUA. I see a number of 
other ligatures in that section of the PUA as well: fb, ffb, ffh, ffj, 
ffk, fft, fh, fj, fk, ft and so on. The result is a great many very nice 
looking PDFs that can't be searched reliably.

Peter

On 10/14/2012 06:04 PM, Andrew Cunningham wrote:
>
> This is the nature of the PDF format. It is a preprint format the 
> focuses on glyphs rather than  characters
>
> It partly depends on the font, and the OT features being used.
>
> In theory you can have ActualText in the PDF, but once you move to 
> complex scripts all bets are off. Without a complete rewrite of the 
> PDF standard .... fidelity to the text is not really possible. PDF 
> format wasn't designed to do it.
>
> The way we used PDFs is well outside the design parameters of the format.
>
> It is possible to extract text, but even at its optimal, 
> post-processing would be needed to reorder characters in some complex 
> scripts.
>
> Andrew
>
> On Oct 15, 2012 7:57 AM, "Peter Dyballa" <Peter_Dyballa at web.de 
> <mailto:Peter_Dyballa at web.de>> wrote:
>
>
>     Am 14.10.2012 um 16:30 schrieb Joe Corneli:
>
>     > However, if I extend the MWE there slightly, I can find
>     "prefix", but
>     > not "quantitative".  (My PDF reader is Evince on Ubuntu 12.04.)
>
>     The capital Q is not what you see... GNU Emacs tells me:
>
>                         character: ? (displayed as ?) (codepoint
>     57416, #o160110, #xe048)
>                 preferred charset: unicode (Unicode (ISO10646))
>             code point in charset: 0xE048
>
>     The code point is in the PUA, Private Use Area. I used pdftotext
>     version 0.20.4 to extract the text.
>
>     When I use pdftohtml version 0.20.4 to extract the text and create
>     HTML files, I see in OmniWeb the word: î ^antitative...
>
>     --
>     Greetings
>
>       Pete
>
>     Got Mole problems?
>     Call Avogadro 6.02 x 10^23
>
>
>
>
>     --------------------------------------------------
>     Subscriptions, Archive, and List information, etc.:
>     http://tug.org/mailman/listinfo/xetex
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>    http://tug.org/mailman/listinfo/xetex

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20121014/2c1bc6ae/attachment-0001.html>