[XeTeX] how to do (better) searchable PDFs in xelatex?

Mon Oct 15 11:28:47 CEST 2012

2012/10/15 Mojca Miklavec <mojca.miklavec.lists at gmail.com>:
> On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote:
>>
>> This is the nature of the PDF format. It is a preprint format the focuses on
>> glyphs rather than  characters
>>
>> It partly depends on the font, and the OT features being used.
>>
>> In theory you can have ActualText in the PDF, but once you move to complex
>> scripts all bets are off. Without a complete rewrite of the PDF standard
>> .... fidelity to the text is not really possible. PDF format wasn't designed
>> to do it.
>
> I might be wrong, but pdfTeX-generated documents work fine (after
> adding encoding vector) even though the glyphs populate "random" slots
> is the font (for example T1 encoding) that have nothing to do with
> Unicode.
>
It works with good fonts in good viewers because these "good fonts"
assign proper names to the glyphs. I tested this many years ago not
only in pdftex but also with tex + dvips + either ps2pdf from GS or
Adobe Distiller.

> It should be possible to do something similar in XeTeX/LuaTeX.
>
> I'm not saying that this would solve problems of copy-pasting Arabic
> scripts, but it should be possible to cover alternate glyphs for Latin
> scripts at least.
>
> Mojca
>
> PS: From http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html
>
> There is an optional auxiliary structure called the "ToUnicode" table
> that was introduced into PDF to help with this text retrieval problem.
> A ToUnicode table can be associated with a font that does not normally
> have a way to determine the relationship between glyphs and Unicode
> characters (some do). The table maps strings of glyph identifiers into
> strings of Unicode characters, often just one to one, so that the
> proper character strings can be made from the glyph references in the
> file.
>
ToUnicode can only replace a byte with a sequence of bytes. Type1 font
can encode only 256 characters, therefore such mapping is possible.
Many years ago I developed a ToUnicode map for Velthuis Devanagari:
http://icebearsoft.euweb.cz/dvngpdf/
Complex scripts would require many-to-many mapping but it is
impossible with toUnicode.
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex


-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz