[XeTeX] turn off special characters in PDF

Ross Moore ross.moore at mq.edu.au
Wed Jan 1 00:49:40 CET 2014

Hi Alex,

On 31/12/2013, at 5:20 PM, Alexey Kryukov wrote:

> On Mon, 30 Dec 2013 10:45:39 +1100
> Ross Moore wrote:
>> I've played a lot with this kind of thing, and think that this
>> is the wrong approach. One should use /ActualText to provide
>> the correct Unicode replacement, when one exists. Thus one
>> can extract textual information reliably, even when the PDF
>> uses legacy fonts that may not contain a /ToUnicode resource,
>> or if that resource is inadequate in special situations.
> Well, the /ActualText approach looks an overcomplication for me. I
> think it is intended for very special cases, like treating the 'ck'
> claster in the old German hyphenation rules. For typical ligatures it
> is sufficient to produce a ToUnicode CMap entry mapping the ligature to
> its source characters. That's what xetex (actually xdvipdfmx) actually
> does... unless, as Khaled has correctly specified, the font maps its
> substitution glyphs to PUA or has no glyph names.

Sure. But if you use such fonts for which the CMap is limited
in this way, then /ActualText  is your best friend.

> And I don't fully understand your remark regarding legacy fonts that may
> not contain a /ToUnicode resource, since it's up to the PDF generation
> software (xdvipdfmx in our case) to produce such a resource.

Any time a font character is used in 2 or more different ways,
corresponding to different Unicode points, you will face such issues.

In legacy (e.g. pre-Unicode) fonts this is not uncommon.

For example, in the original CM fonts, the same font character 
was used for both the dot-under and dot-above accents, using macros 
to put the accent within a box and position it either above or
below the letter being accented. 
The CMap file can only specify a single value for this character.
What should be the Unicode value?
Should it be within the "Combining Character" range?

But it is worse than this: for dot-above, the accent appears
within the PDF *before* the letter being accented, while for 
the dot-under it comes afterwards. Thus combining characters
will not work, but can result in the wrong letter being accented.

Using an /ActualText is the only reasonable way to cope with 
this --- apart from switching fonts, of course.

Another example is the  ellipsis '…' for which people
often just use '...' in the source. 
One can use /ActualText  to map this combination to the correct 
Unicode character.

Greek capitals, which look the same as latin letters, is another

There are plenty more examples coming from mathematics; especially
if you variable names to copy/paste as Plane-1 alphanumerics.

The attached file (produced using pdfTeX, not XeTeX) is an example 
that I've used in TUG talks, and elsewhere.
Try copy/paste of portions of the mathematics. Be aware that you can 
get different results depending upon the PDF viewer used when 
extracting the text.  (The file has uncompressed streams, so you
can view it in a decent text editor to see the tagging structures 
used within the PDF content.)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2013-Assign2-soln.pdf
Type: application/pdf
Size: 744843 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20140101/de3112eb/attachment-0001.pdf>
-------------- next part --------------

> -- 
> Regards,
> Alexey Kryukov <anagnost at yandex dot ru>
> Moscow State University
> Faculty of History

Hope this helps,


Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-206      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114

-------------- next part --------------
A non-text attachment was scrubbed...
Name: logo.png
Type: image/png
Size: 5257 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20140101/de3112eb/attachment-0001.png>
-------------- next part --------------

More information about the XeTeX mailing list