[XeTeX] how to do (better) searchable PDFs in xelatex?

Mon Oct 15 10:13:24 CEST 2012

On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote:
>
> This is the nature of the PDF format. It is a preprint format the focuses on
> glyphs rather than  characters
>
> It partly depends on the font, and the OT features being used.
>
> In theory you can have ActualText in the PDF, but once you move to complex
> scripts all bets are off. Without a complete rewrite of the PDF standard
> .... fidelity to the text is not really possible. PDF format wasn't designed
> to do it.

I might be wrong, but pdfTeX-generated documents work fine (after
adding encoding vector) even though the glyphs populate "random" slots
is the font (for example T1 encoding) that have nothing to do with
Unicode.

It should be possible to do something similar in XeTeX/LuaTeX.

I'm not saying that this would solve problems of copy-pasting Arabic
scripts, but it should be possible to cover alternate glyphs for Latin
scripts at least.

Mojca

PS: From http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html

There is an optional auxiliary structure called the "ToUnicode" table
that was introduced into PDF to help with this text retrieval problem.
A ToUnicode table can be associated with a font that does not normally
have a way to determine the relationship between glyphs and Unicode
characters (some do). The table maps strings of glyph identifiers into
strings of Unicode characters, often just one to one, so that the
proper character strings can be made from the glyph references in the
file.