[XeTeX] turn off special characters in PDF

Zdenek Wagner zdenek.wagner at gmail.com
Mon Dec 30 18:03:14 CET 2013


2013/12/30 Joe Corneli <holtzermann17 at gmail.com>:
> Thanks Ross.
>
> I think in this case all I really need is to revise \href code to
> insert /ActualText  (because I'm using small caps for hyperlinks in
> this doc).  Pretty much everything else works fine already.
>
Small caps have nothing to do with the code points, it is just the
shape of the characters. If you enter \textsc{something}, copy&paste
should result in lowercase something.

> Joe
>
> On Sun, Dec 29, 2013 at 11:45 PM, Ross Moore <ross.moore at mq.edu.au> wrote:
>> Hi Joe,
>>
>> On 30/12/2013, at 8:12 AM, Joe Corneli wrote:
>>
>>> This answer talks about how to turn off litgatures:
>>> http://tex.stackexchange.com/a/5419/4357
>>>
>>> Is there a way to turn off *all* special characters (e.g. small caps)
>>> and just get ASCII characters in the copy-and-paste level of the PDF?
>>
>> In short, no!
>>  — because this is against the idea of making more use of Unicode,
>> across all computing platforms.
>>
>> Certainly a ligature can have an /ActualText replacement consisting
>> of the separate characters, but this requires the PDF producer
>> to have supplied this within the PDF, as it is being generated.
>>
>> I've played a lot with this kind of thing, and think that this
>> is the wrong approach. One should use /ActualText to provide
>> the correct Unicode replacement, when one exists. Thus one
>> can extract textual information reliably, even when the PDF
>> uses legacy fonts that may not contain a /ToUnicode resource,
>> or if that resource is inadequate in special situations.
>>
>>
>> Besides, do you really mean *all* special characters?
>> What about simple symbols like: ß∑∂√∫Ω  and all the other
>> myriad foreign/accented letters and mathematical symbols?
>>
>> If you want these to Copy/Paste as TeX coding (\beta  \Sum \delta
>> \sqrt etc.) within documents that you write yourself, then I wrote
>> a package called  mmap  where this is an option for the original
>> Computer Modern fonts.
>>
>>
>> Alternatively, a PDF reader might supply a filtering mode that
>> converts the ligatures back to separate characters. Then the
>> user ought to be able to choose whether or not to use this filter.
>> I don't know of any that actually do this.
>> (In any case, you would want such a tool to allow you to specify
>> which characters to replace, and which to preserve.)
>>
>>
>> Your best option is surely to (get someone else to) write such
>> a filter that meets your needs, and use it to post-process the text
>> extracted via Copy/Paste or with other text-extraction tools.
>>
>> Of course this is no use if your aim is to create documents for
>> which others get the desired result via Copy/Paste.
>> For this, the /ActualText approach is what you need.
>>
>>
>>
>> Hope this helps,
>>
>>         Ross
>>
>> ------------------------------------------------------------------------
>> Ross Moore                                       ross.moore at mq.edu.au
>> Mathematics Department                           office: E7A-206
>> Macquarie University                             tel: +61 (0)2 9850 8955
>> Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
>> ------------------------------------------------------------------------
>>
>>
>>
>>
>>
>>
>> --------------------------------------------------
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
>>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



More information about the XeTeX mailing list