[XeTeX] turn off special characters in PDF

Ross Moore ross.moore at mq.edu.au
Wed Jan 1 12:07:54 CET 2014

Hi Zdenek, and others,

On 01/01/2014, at 11:53, Zdenek Wagner <zdenek.wagner at gmail.com> wrote:

>> The attached file (produced using pdfTeX, not XeTeX) is an example
>> that I've used in TUG talks, and elsewhere.
>> Try copy/paste of portions of the mathematics. Be aware that you can
>> get different results depending upon the PDF viewer used when
>> extracting the text.  (The file has uncompressed streams, so you
>> can view it in a decent text editor to see the tagging structures
>> used within the PDF content.)
> If I remember it well, ActualString supports only bytes, not
> cotepoints. Thus accfented characters cannot be encoded, neither Indic
> characters.

I don't know what you mean by this.
In my testing I can tag pretty-much any piece of content, and map it to any string using /ActualText .
Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with it,
modulo some bugs that have been reported when using very long replacement strings.

In the example PDF that I attached to my previous message, each mathematical character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 alphanumerics expressed using surrogate pairs. 

I see no reason why Indic character strings could not be done similarly.
You probably need some on-the-fly preprocessing to work out the required strings to use.
This is certainly possible, and is what I do with mathematical expressions.
It should be possible to do it entirely within TeX, but the programming can get very tricky, so I use Perl instead.

> ToUnicode supports one byte to many bytes, not many bytes
> to many bytes.

Exactly. This is why /ActualText  is the structure to use.

> Indic scripts use reordering where a matra precedes the
> consonants or some scripts contain two-piece matras. Unless the
> specification was corrected the ToUnicode map is unable to handle the
> Indic scritps properly.

Agreed;  /ToUnicode  is not what is needed here.
This sounds like precisely the kind of situation where you want to tag an extended block of content and use /ActualText  to map it to a pre-constructed Unicode string.
I'm no expert in Indic languages, so cannot provide specific details or examples.

>>> --
>>> Regards,
>>> Alexey Kryukov <anagnost at yandex dot ru>
>>> Moscow State University
>>> Faculty of History
>> Hope this helps,
>>        Ross

>> -- 
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz

Happy New Year,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20140101/34ea1de0/attachment.html>

More information about the XeTeX mailing list