[XeTeX] Hyperref \hyperlink and \hypertarget not working with accented characters

Thu Nov 3 01:51:34 CET 2011

Hi Zdeněk,

On 02/11/2011, at 9:19 PM, Zdenek Wagner wrote:

>> Don't follow that, Zdenek : the older PDFs will not change,
>> will still contain US ASCII strings and so on, but a newer
>> reader would be able to handle UTF-<whatever> strings as
>> well -- that was my thinking.
>> 
> No, it won't be that easy. Syntax (string) in links is in
> AdobeStandardEncoding and some of these characters are not valid in
> UTF-8.

What has AdobeStandardEncoding got to do with it?
We are not talking about font encodings here, but what
PDF uses internally as a string for destination names.

There are 2 ways this can be handled.

1. §7.3.5  Name Objects

>>> Beginning with PDF 1.2 a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). Uniquely defined means that any two name objects made up of the same sequence of characters denote the same object. Atomic means that a name has no internal structure; although it is defined by a sequence of characters, those characters are not considered elements of the name.
>>> 
>>> When writing a name in a PDF file, a SOLIDUS (2Fh) (/) shall be used to introduce a name. The SOLIDUS is not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name in the PDF file and shall follow these rules:
>>> 
>>> a) A NUMBER SIGN (23h) (#) in a name shall be written by using its 2-digit hexadecimal code (23), preceded by the NUMBER SIGN.
>>> 
>>> b) Any character in a name that is a regular character (other than NUMBER SIGN) shall be written as itself or by using its 2-digit hexadecimal code, preceded by the NUMBER SIGN.
>>> 
>>> c) Any character that is not a regular character shall be written using its 2-digit hexadecimal code, preceded by the NUMBER SIGN only.
>>> 
>>> NOTE 1
>>> There is not a unique encoding of names into the PDF file because regular characters may be coded in either of two ways.
>>> White space used as part of a name shall always be coded using the 2-digit hexadecimal notation and no white space may intervene between the SOLIDUS and the encoded name.
>>> 
>>> Regular characters that are outside the range EXCLAMATION MARK(21h) (!) to TILDE (7Eh) (~) should be written using the hexadecimal notation.
>>> The token SOLIDUS (a slash followed by no regular characters) introduces a unique valid name defined by the empty sequence of characters.

© Adobe Systems Incorporated 2008 – All rights reserved 
PDF 32000-1:2008

Note that many non-letter characters need to be written using a syntax of #.. for the hexadecimal code of that character.
Arbitrary byte-strings are *not* allowed in this form.

But arbitrary UTF-8 sequences can be encoded this way.
For example,  "rAsociación"   could be encoded 
as a Name as    /rAsociaci#C3#B3n .
or equivalently as  /r#41sociaci#C3#B3n  and many other ways
     even  /#72#41#73#6F#63#69#61#63#69#C3#B3#6E .

2.  §7.9.2.4   Byte String Type

>>> The byte string type shall be used for binary data that shall be represented as a series of bytes, where each byte may be any value representable in 8 bits. Byte string type is a subtype of string type.
>>> NOTE  The string may represent characters but the encoding is not known. The bytes of the string may not represent characters.

This sounds like you can have any bytes whatever, but that
is actually not true, since the "byte string" type is a sub-type
of "string type" for which certain rules apply.

§7.3.4 String Objects
>>> §7.3.4.1 General
>>> A string object shall consist of a series of zero or more bytes. String objects are not integer objects, but are stored in a more compact format. The length of a string may be subject to implementation limits; see Annex C.
>>> String objects shall be written in one of the following two ways:
>>> •As a sequence of literal characters enclosed in parentheses ( ) (using LEFT PARENTHESIS (28h) and RIGHT PARENThESIS (29h)); see 7.3.4.2, "Literal Strings."
>>> •As hexadecimal data enclosed in angle brackets < > (using LESS-THAN SIGN (3Ch) and GREATER-THAN SIGN (3Eh)); see 7.3.4.3, "Hexadecimal Strings.

Literal strings are enclosed in parentheses (...)
so there is a need to worry about isolated parentheses '(' or ')' 
and other non-letter characters \ # 
and whitespace chars \n \t \r \b \f   (newline, tab, etc. )
and can use octal codes  \ddd  (d = digit) 

For example,  "rAsociación"   could be encoded 
as a literal string as    (rAsociaci\303\263n) .

§ 7.3.4.3  Hexadecimal Strings
>>> Strings may also be written in hexadecimal form, which is useful for including arbitrary binary data in a PDF file. A hexadecimal string shall be written as a sequence of hexadecimal digits (0–9 and either A–F or a–f) encoded as ASCII characters and enclosed within angle brackets (using LESS-THAN SIGN (3Ch) and GREATER-THAN SIGN (3Eh)).
>>> EXAMPLE 1   <4E6F762073686D6F7A206B6120706F702E>

For example,  "rAsociación"   could be encoded 
as a hexadecimal string as  
     <7241736F6369616369C3B36E> .

In all these cases the ó is represented as an ASCII-based
textual encoding of a pair of bytes:
   #C3#B3  or  C3B3  or  \303\263   
in different contexts.

It is up to the PDF application to realise that this is UTF8
and show the ó character, if that is actually appropriate.

The PDF Spec says this *is* appropriate for  Name Objects:

>>> In such situations, the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII. This enables a name object to represent text virtually in any natural language, subject to the implementation limit on the length of a name.
>>> 
>>> NOTE 4
>>> 
>>> PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.

Thus it probably cannot be assumed for the Literal and Hexadecimal types.

The issue for XeTeX+xdvipdfmx+hyperref  is whether the specs
are really being followed.  I suspect that they are not.
Here's why.

In the attached PDF, based on Andy Black's original posting,
one can look inside the PDF at the hyperlink and destination.

The destination is given as an element in the /Names array:

<<
/Names[(Doc-Start)12 0 R(page.1)13 0 R(rAsociaci\303\263n)14 0 R]
>>

which uses the "Literal String" form.
This is certainly valid.

The hyperlink on the other hand is as follows:

 <<
/Type/Annot
/Subtype/Link
/Border[0 0 0]
/C[1 0 0]
/A<<
/S/GoTo
/D<feff007200410073006f0063006900610063006900f3006e>
>>
/Rect[91.801 435.627 153.761 453.471]
>>

which uses a UTF-16 Hexadecimal "text string" type, with BOM.
Here the ó is represented as 00f3 .

This could be where the problem lies, as this "text string"
may not be being recognised as a valid type for a Named destination.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: spanish-dest.pdf
Type: application/pdf
Size: 82850 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20111103/647d090b/attachment-0001.pdf>
-------------- next part --------------

> 
>> ** Phil.
>> 

> -- 
> Zdeněk Wagner

Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------