[XeTeX] turn off special characters in PDF

Zdenek Wagner zdenek.wagner at gmail.com
Wed Jan 1 01:53:49 CET 2014


2014/1/1 Ross Moore <ross.moore at mq.edu.au>:
> Hi Alex,
>
> On 31/12/2013, at 5:20 PM, Alexey Kryukov wrote:
>
>> On Mon, 30 Dec 2013 10:45:39 +1100
>> Ross Moore wrote:
>>
>>> I've played a lot with this kind of thing, and think that this
>>> is the wrong approach. One should use /ActualText to provide
>>> the correct Unicode replacement, when one exists. Thus one
>>> can extract textual information reliably, even when the PDF
>>> uses legacy fonts that may not contain a /ToUnicode resource,
>>> or if that resource is inadequate in special situations.
>>
>> Well, the /ActualText approach looks an overcomplication for me. I
>> think it is intended for very special cases, like treating the 'ck'
>> claster in the old German hyphenation rules. For typical ligatures it
>> is sufficient to produce a ToUnicode CMap entry mapping the ligature to
>> its source characters. That's what xetex (actually xdvipdfmx) actually
>> does... unless, as Khaled has correctly specified, the font maps its
>> substitution glyphs to PUA or has no glyph names.
>
> Sure. But if you use such fonts for which the CMap is limited
> in this way, then /ActualText  is your best friend.
>
>>
>> And I don't fully understand your remark regarding legacy fonts that may
>> not contain a /ToUnicode resource, since it's up to the PDF generation
>> software (xdvipdfmx in our case) to produce such a resource.
>
> 1.
> Any time a font character is used in 2 or more different ways,
> corresponding to different Unicode points, you will face such issues.
>
> In legacy (e.g. pre-Unicode) fonts this is not uncommon.
>
> For example, in the original CM fonts, the same font character
> was used for both the dot-under and dot-above accents, using macros
> to put the accent within a box and position it either above or
> below the letter being accented.
> The CMap file can only specify a single value for this character.
> What should be the Unicode value?
> Should it be within the "Combining Character" range?
>
> But it is worse than this: for dot-above, the accent appears
> within the PDF *before* the letter being accented, while for
> the dot-under it comes afterwards. Thus combining characters
> will not work, but can result in the wrong letter being accented.
>
> Using an /ActualText is the only reasonable way to cope with
> this --- apart from switching fonts, of course.
>
>
> 2.
> Another example is the  ellipsis '...' for which people
> often just use '...' in the source.
> One can use /ActualText  to map this combination to the correct
> Unicode character.
>
>
> 3.
> Greek capitals, which look the same as latin letters, is another
> example.
>
>
> 4.
> There are plenty more examples coming from mathematics; especially
> if you variable names to copy/paste as Plane-1 alphanumerics.
>
> The attached file (produced using pdfTeX, not XeTeX) is an example
> that I've used in TUG talks, and elsewhere.
> Try copy/paste of portions of the mathematics. Be aware that you can
> get different results depending upon the PDF viewer used when
> extracting the text.  (The file has uncompressed streams, so you
> can view it in a decent text editor to see the tagging structures
> used within the PDF content.)
>
If I remember it well, ActualString supports only bytes, not
cotepoints. Thus accfented characters cannot be encoded, neither Indic
characters. ToUnicode supports one byte to many bytes, not many bytes
to many bytes. Indic scripts use reordering where a matra precedes the
consonants or some scripts contain two-piece matras. Unless the
specification was corrected the ToUnicode map is unable to handle the
Indic scritps properly.
>
>
>
>
>>
>> --
>> Regards,
>> Alexey Kryukov <anagnost at yandex dot ru>
>>
>> Moscow State University
>> Faculty of History
>
>
>
> Hope this helps,
>
>         Ross
>
> ------------------------------------------------------------------------
> Ross Moore                                       ross.moore at mq.edu.au
> Mathematics Department                           office: E7A-206
> Macquarie University                             tel: +61 (0)2 9850 8955
> Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
> ------------------------------------------------------------------------
>
>
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



More information about the XeTeX mailing list