[XeTeX] turn off special characters in PDF

Zdenek Wagner zdenek.wagner at gmail.com
Mon Dec 30 18:03:14 CET 2013

2013/12/30 Joe Corneli <holtzermann17 at gmail.com>:
> Thanks Ross.
> I think in this case all I really need is to revise \href code to
> insert /ActualText  (because I'm using small caps for hyperlinks in
> this doc).  Pretty much everything else works fine already.
Small caps have nothing to do with the code points, it is just the
shape of the characters. If you enter \textsc{something}, copy&paste
should result in lowercase something.

> Joe
> On Sun, Dec 29, 2013 at 11:45 PM, Ross Moore <ross.moore at mq.edu.au> wrote:
>> Hi Joe,
>> On 30/12/2013, at 8:12 AM, Joe Corneli wrote:
>>> This answer talks about how to turn off litgatures:
>>> http://tex.stackexchange.com/a/5419/4357
>>> Is there a way to turn off *all* special characters (e.g. small caps)
>>> and just get ASCII characters in the copy-and-paste level of the PDF?
>> In short, no!
>>  — because this is against the idea of making more use of Unicode,
>> across all computing platforms.
>> Certainly a ligature can have an /ActualText replacement consisting
>> of the separate characters, but this requires the PDF producer
>> to have supplied this within the PDF, as it is being generated.
>> I've played a lot with this kind of thing, and think that this
>> is the wrong approach. One should use /ActualText to provide
>> the correct Unicode replacement, when one exists. Thus one
>> can extract textual information reliably, even when the PDF
>> uses legacy fonts that may not contain a /ToUnicode resource,
>> or if that resource is inadequate in special situations.
>> Besides, do you really mean *all* special characters?
>> What about simple symbols like: ß∑∂√∫Ω  and all the other
>> myriad foreign/accented letters and mathematical symbols?
>> If you want these to Copy/Paste as TeX coding (\beta  \Sum \delta
>> \sqrt etc.) within documents that you write yourself, then I wrote
>> a package called  mmap  where this is an option for the original
>> Computer Modern fonts.
>> Alternatively, a PDF reader might supply a filtering mode that
>> converts the ligatures back to separate characters. Then the
>> user ought to be able to choose whether or not to use this filter.
>> I don't know of any that actually do this.
>> (In any case, you would want such a tool to allow you to specify
>> which characters to replace, and which to preserve.)
>> Your best option is surely to (get someone else to) write such
>> a filter that meets your needs, and use it to post-process the text
>> extracted via Copy/Paste or with other text-extraction tools.
>> Of course this is no use if your aim is to create documents for
>> which others get the desired result via Copy/Paste.
>> For this, the /ActualText approach is what you need.
>> Hope this helps,
>>         Ross
>> ------------------------------------------------------------------------
>> Ross Moore                                       ross.moore at mq.edu.au
>> Mathematics Department                           office: E7A-206
>> Macquarie University                             tel: +61 (0)2 9850 8955
>> Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
>> ------------------------------------------------------------------------
>> --------------------------------------------------
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex

Zdeněk Wagner

More information about the XeTeX mailing list