[XeTeX] turn off special characters in PDF

Zdenek Wagner zdenek.wagner at gmail.com
Wed Jan 1 16:14:01 CET 2014


2014/1/1 Ross Moore <ross.moore at mq.edu.au>:
> Hi Zdenek, and others,
>
> On 01/01/2014, at 11:53, Zdenek Wagner <zdenek.wagner at gmail.com> wrote:
>
> The attached file (produced using pdfTeX, not XeTeX) is an example
>
> that I've used in TUG talks, and elsewhere.
>
> Try copy/paste of portions of the mathematics. Be aware that you can
>
> get different results depending upon the PDF viewer used when
>
> extracting the text.  (The file has uncompressed streams, so you
>
> can view it in a decent text editor to see the tagging structures
>
> used within the PDF content.)
>
>
> If I remember it well, ActualString supports only bytes, not
> cotepoints. Thus accfented characters cannot be encoded, neither Indic
> characters.
>
>
> I don't know what you mean by this.
> In my testing I can tag pretty-much any piece of content, and map it to any
> string using /ActualText .
> Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with
> it,
> modulo some bugs that have been reported when using very long replacement
> strings.
>
> In the example PDF that I attached to my previous message, each mathematical
> character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
> alphanumerics expressed using surrogate pairs.
>
Thank you, now I see it. The book where I read about /ActualText did
not mention that I can use UTF16 if I start the string with BOM. Can I
see the source of the PDF? It could help me much to see how you do all
these things.

> I see no reason why Indic character strings could not be done similarly.
> You probably need some on-the-fly preprocessing to work out the required
> strings to use.
> This is certainly possible, and is what I do with mathematical expressions.
> It should be possible to do it entirely within TeX, but the programming can
> get very tricky, so I use Perl instead.
>
> ToUnicode supports one byte to many bytes, not many bytes
> to many bytes.
>
>
> Exactly. This is why /ActualText  is the structure to use.
>
>
> Indic scripts use reordering where a matra precedes the
> consonants or some scripts contain two-piece matras. Unless the
> specification was corrected the ToUnicode map is unable to handle the
> Indic scritps properly.
>
>
> Agreed;  /ToUnicode  is not what is needed here.
> This sounds like precisely the kind of situation where you want to tag an
> extended block of content and use /ActualText  to map it to a
> pre-constructed Unicode string.
> I'm no expert in Indic languages, so cannot provide specific details or
> examples.
>
>
>
> --
>
> Regards,
>
> Alexey Kryukov <anagnost at yandex dot ru>
>
>
> Moscow State University
>
> Faculty of History
>
>
>
>
> Hope this helps,
>
>
>        Ross
>
>
> --
>
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz
>
>
> Happy New Year,
>
>
>     Ross
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



More information about the XeTeX mailing list