[XeTeX] turn off special characters in PDF

Khaled Hosny khaledhosny at eglug.org
Wed Jan 1 12:49:17 CET 2014


On Wed, Jan 01, 2014 at 10:07:54PM +1100, Ross Moore wrote:
> > ToUnicode supports one byte to many bytes, not many bytes
> > to many bytes.
> 
> Exactly. This is why /ActualText  is the structure to use.

My only issue with /ActualText is that using it to tag whole words
breaks fine text selection (one can not select individual characters
inside these words and searching for one character will highlight the
whole word containing it). Otherwise it is the most versatile mechanism
to preserve original text in PDF files.

Because of that, I think a better strategy is to use /ToUnicode mapping
whenever applicable and resort to /ActualText text for the problematic
cases, namely one to many substitutions, reordering and different
substitutions leading to the same glyph (though the last one can be
handled by duplicating the glyph under different name/encoding when
subsetting the font).

The situation in XeTeX is more complex because the typesetting (where
the original text string is known) is done in XeTeX, while the PDF
generation is done by the PDF driver and the communication channel
between both (XDV files) passes only glyph ids not the original text
strings, so we can only rely on font encodings and glyph names (or try
to guess glyph names from by examining simple font substitutions in the
upcoming patch).

Regards,
Khaled


More information about the XeTeX mailing list