[XeTeX] Anchor names

Sat Nov 5 01:59:29 CET 2011

Hi Heiko,

On 04/11/2011, at 12:15 PM, Heiko Oberdiek wrote:
>> 
>>> As we have learned, the PDF specification uses byte strings
>>> for anchor names. And there is a wish to use "normal" characters
>>> in anchor names.
>> 
>> Within the (La)TeX source, yes!
>> Of course it needs to be encoded to be safe within the PDF.
> 
> That's the problem, the anchor names could also be used as
> "official" part of the PDF file, because it could be referenced, e.g.:
>  mybeautifuldocument.pdf#Introduction

OK.
And the name of the destination might show up within an annotation,
such as a popup Tool Tip. You would like that to look correct.

>>> The link is not working. Looking into the PDF file we can find
>>> the link annotation:
>>> 
>>> 4 0 obj
>>> <<
>>> /Type/Annot
>>> /Subtype/Link
>>> /Border[0 0 1]
>>> /C[0 0 1]
>>> /A<<
>>> /S/GoTo
>>> /D<feff00e4006e0063006800f80072>
>> 
>> In my reading of the PDF Spec. I came to the conclusion
>> that this UTF-16BE based format is not supported for Name objects.
>> 
>> But maybe I'm wrong here.
> 
> My understanding is that it does not matter, whether the byte
> string could be interpreted in some encoding. The characters
> are just bytes.

My comment also need not matter, since TeX software is creating
strings, not Name objects --- which are of the form  /myname .

The PDF spec says that either can be used for Destination names,
but that since PDF 1.2, the string form is preferred.

> Also there are keys in the /Dests name tree
> and are compared at the byte level. Thus a name encoded as
> UTF-8, ISO-8859-1 or UTF-16BE are different strings and thus
> different names.
> 
>>> Destination: <c3a46e6368c3b872> ==> UTF-8
>>> Link annot.: <feff00e4006e0063006800f80072> ==> UTF-16BE with BOM

Yes. 
Now I agree with you that PDF just sees these as encoded bytes.
They do not occur within the correct contexts to have an interpretation 
based upon some encoding, no matter how obvious that may seem.

Figure 7 in the attached image indicates that "byte strings" and 
"text strings" are distinct. And the only references to UTF-16BE
within the spec are in the context of "text strings".
But just because  <feff00e4006e0063006800f80072> looks like a UTF-16BE 
with BOM does not mean that it is always treated that way.
It can equally be treated as a string of encoded bytes.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2011-11-04 at 3.12.27 PM.png
Type: image/png
Size: 101768 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20111105/7cabbe31/attachment-0001.png>
-------------- next part --------------

In fact, I prefer the UTF16-BE, even though it is longer.
This is what I use for /ActualText  replacements of mathematical
symbols, in my work for Tagged PDF.
It works very well in that context, as it has the obvious mapping
to Unicode, which is picked up with Copy-Paste as well as Save As XML.

>> 
>> The spec reads that differences in Literal strings are allowed,
>> provided that they convert to the same thing in Unicode.
>> So there must be an internal representation that Adobe uses,
>> but is not visible to us, as builders of PDF documents.
> 
> Where, which section?

This was concerning ?7.3.5 Name Objects.
But TeX isn't generating these, so it does not apply.
Sorry for the FUD.

> 
> A literal string can be written different ways at syntax level:
> 
>  (test) = <74657374> = (\164\145\163\164) = (\164e\163t)
> 
> Probably you are referring the "Text String Type" used for
> the text in the bookmarks, the document information and other
> places. These strings can be encoded either in PDFDocEncoding
> or UTF-16BE with BOM.

> 
>>> Conclusion:
>>> * The encoding mess with 8-bit characters remain even with XeTeX.
>> 
>> Well, surely it is manifest only in the driver part:  xdvipdfmx
> 
> No, the problem are both parts. XeTeX can only write UTF-8,
> the death for binary data.

But the bytes need to be encoded anyway, as hexadecimal.
So why cannot this be done before writing out the resulting string?
 ...

> 
>>> Then I tried to be clever and a workaround by using
>>> /D<c3a46e6368c3b872> for the link name in the source.
>>> But it got converted and the PDF file still contains:
>>> /D<feff00e4006e0063006800f80072>
>>> 
>>> Only the other way worked:
>>> 
>>> \special{pdf:dest <feff00e4006e0063006800f80072> ...}
>>> \special{pdf:ann ... /D(?nch?r) ...}

 ... as this seems to be doing.
I'd vote for *always* doing  pdf:dest  this way. 
Then for consistency, do  pdf:ann  as if UTF-16BE  also.

>> 
>> OK. 
>> Glad you did this test.
>> It shows two things:
>> 
>>  1.  that such text strings may well be valid for Names,
>>      and that the PDF spec. is unclear about this;
> 
> I can't follow. Both string representations are covered
> by the PDF specification, a literal string can be
> specified in parentheses with an escaping mechanism (backslash)
> or given as hex string in angle brackets. Unclear is the
> syntax of the argument for \special{pdf:dest ...}.

Agreed.
Can we standardise on the way that *looks like*  UTF-16BE with BOM.

> 
>>  2.  these UTF16-BE strings are *not* equivalent to other
>>      ways of encoding Name objects, after all.
>> 
>> This is something that should be reported as a bug to Adobe.
> 
> There is no problem with the PDF specification. A destination
> name is a byte string. You can use UTF-16BE, invalid UTF-8,
> a mixture of UTF-32BE with us-ascii, ... all are valid byte strings.
> The problem is with xdvipdfmx that recodes the UTF-8 string
> provided by XeTeX's specials in different ways.

Then convert the UTF-8 to the encoded HeX of the corresponding UTF16-BE,
before passing it to  xdvipdfmx .

Surely that is feasible?

>> Do it both with XeTeX and pdfTeX (with appropriate inputenc, 
>> to handle the UTF8 input), to test whether there are any 
>> differences.  
> 
> pdfTeX is fine, because it doesn't reencode the strings.
> Also \pdfescapestring, \pdfescapename, \pdfescapehex
> are available for syntactically correct literal strings.

I've not used these primitives.
Didn't you used to do such conversions within hyperref ?
Or with other utility packages in the 'oberdiek' bundle?

> 
>> I've not tested pdfTeX yet, because of the extra macro layer
>> required. Does  hyperref  handle the required conversions then? 
> 
> It depends on which part of hyperref you are looking.

Presumably you push off to the driver such conversions
as it supports internally.

> 
> Yours sincerely
>  Heiko Oberdiek

Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------