[XeTeX] turn off special characters in PDF

Jonathan Kew jfkthame at googlemail.com
Wed Jan 1 13:15:09 CET 2014

On 1/1/14 11:49, Khaled Hosny wrote:

> The situation in XeTeX is more complex because the typesetting (where
> the original text string is known) is done in XeTeX, while the PDF
> generation is done by the PDF driver and the communication channel
> between both (XDV files) passes only glyph ids not the original text
> strings

I'd suggest that the best way forward here would be to modify xetex such 
that it includes the original Unicode text in the xdv stream, as well as 
the positioned glyphs. Then the driver can write a correct ActualText 
for each word.

There'd be some performance cost to this, of course; the inclusion of 
the Unicode text could be an optional feature, so that people who just 
want a "throwaway" pdf in order to print a document don't have to suffer 
slower generation and/or larger files.

This wouldn't address all the problems with pdf text extraction; 
higher-level issues of text structure and flow would still be tricky in 
the case of documents with any complex layout. But at least the basic 
Unicode characters making up each word would be reliably correct.


More information about the XeTeX mailing list