[XeTeX] turn off special characters in PDF
jfkthame at googlemail.com
Wed Jan 1 13:15:09 CET 2014
On 1/1/14 11:49, Khaled Hosny wrote:
> The situation in XeTeX is more complex because the typesetting (where
> the original text string is known) is done in XeTeX, while the PDF
> generation is done by the PDF driver and the communication channel
> between both (XDV files) passes only glyph ids not the original text
I'd suggest that the best way forward here would be to modify xetex such
that it includes the original Unicode text in the xdv stream, as well as
the positioned glyphs. Then the driver can write a correct ActualText
for each word.
There'd be some performance cost to this, of course; the inclusion of
the Unicode text could be an optional feature, so that people who just
want a "throwaway" pdf in order to print a document don't have to suffer
slower generation and/or larger files.
This wouldn't address all the problems with pdf text extraction;
higher-level issues of text structure and flow would still be tricky in
the case of documents with any complex layout. But at least the basic
Unicode characters making up each word would be reliably correct.
More information about the XeTeX