[XeTeX] how to do (better) searchable PDFs in xelatex?

Andrew Cunningham lang.support at gmail.com
Mon Oct 15 00:04:26 CEST 2012


This is the nature of the PDF format. It is a preprint format the focuses
on glyphs rather than  characters

It partly depends on the font, and the OT features being used.

In theory you can have ActualText in the PDF, but once you move to complex
scripts all bets are off. Without a complete rewrite of the PDF standard
.... fidelity to the text is not really possible. PDF format wasn't
designed to do it.

The way we used PDFs is well outside the design parameters of the format.

It is possible to extract text, but even at its optimal, post-processing
would be needed to reorder characters in some complex scripts.

Andrew
On Oct 15, 2012 7:57 AM, "Peter Dyballa" <Peter_Dyballa at web.de> wrote:

>
> Am 14.10.2012 um 16:30 schrieb Joe Corneli:
>
> > However, if I extend the MWE there slightly, I can find "prefix", but
> > not "quantitative".  (My PDF reader is Evince on Ubuntu 12.04.)
>
> The capital Q is not what you see… GNU Emacs tells me:
>
>                     character:  (displayed as ) (codepoint 57416,
> #o160110, #xe048)
>             preferred charset: unicode (Unicode (ISO10646))
>         code point in charset: 0xE048
>
> The code point is in the PUA, Private Use Area. I used pdftotext version
> 0.20.4 to extract the text.
>
> When I use pdftohtml version 0.20.4 to extract the text and create HTML
> files, I see in OmniWeb the word: î ˆantitative…
>
> --
> Greetings
>
>   Pete
>
> Got Mole problems?
> Call Avogadro 6.02 x 10^23
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20121015/8dde3107/attachment.html>


More information about the XeTeX mailing list