[XeTeX] Anchor names

Fri Nov 4 02:15:00 CET 2011

On Fri, Nov 04, 2011 at 07:31:02AM +1100, Ross Moore wrote:

> On 04/11/2011, at 1:58 AM, Heiko Oberdiek wrote:
> 
> > Hello,
> > 
> > to get more to the point, I start a new thread.
> 
> Yes. very good idea.
> 
> > As we have learned, the PDF specification uses byte strings
> > for anchor names. And there is a wish to use "normal" characters
> > in anchor names.
> 
> Within the (La)TeX source, yes!
> Of course it needs to be encoded to be safe within the PDF.

That's the problem, the anchor names could also be used as
"official" part of the PDF file, because it could be referenced, e.g.:
  mybeautifuldocument.pdf#Introduction

> 
> > Let's make an example:
> > 
> > xetex --ini --output-driver='xdvipdfmx -V4' test
> > 
> >      \special{pdf:dest (änchør) [@thispage /XYZ @xpos @ypos null]}%
> 
> >       \special{%
> >         pdf:ann width 4bp height 2bp depth 2bp<<%
> >           /Type/Annot%
> >           /Subtype/Link%
> >           /Border[0 0 1]%
> >           /C[0 0 1]% blue border
> >           /A<<%
> >             /S/GoTo%
> >             /D(änchør)%
> 
> > The link is not working. Looking into the PDF file we can find
> > the link annotation:
> > 
> >  4 0 obj
> >  <<
> >  /Type/Annot
> >  /Subtype/Link
> >  /Border[0 0 1]
> >  /C[0 0 1]
> >  /A<<
> >  /S/GoTo
> >  /D<feff00e4006e0063006800f80072>
> 
> In my reading of the PDF Spec. I came to the conclusion
> that this UTF-16BE based format is not supported for Name objects.
> 
> But maybe I'm wrong here.

My understanding is that it does not matter, whether the byte
string could be interpreted in some encoding. The characters
are just bytes. Also there are keys in the /Dests name tree
and are compared at the byte level. Thus a name encoded as
UTF-8, ISO-8859-1 or UTF-16BE are different strings and thus
different names.

> > Destination: <c3a46e6368c3b872> ==> UTF-8
> > Link annot.: <feff00e4006e0063006800f80072> ==> UTF-16BE with BOM
> 
> The spec reads that differences in Literal strings are allowed,
> provided that they convert to the same thing in Unicode.
> So there must be an internal representation that Adobe uses,
> but is not visible to us, as builders of PDF documents.

Where, which section?

A literal string can be written different ways at syntax level:

  (test) = <74657374> = (\164\145\163\164) = (\164e\163t)

Probably you are referring the "Text String Type" used for
the text in the bookmarks, the document information and other
places. These strings can be encoded either in PDFDocEncoding
or UTF-16BE with BOM.

> > Conclusion:
> > * The encoding mess with 8-bit characters remain even with XeTeX.
> 
> Well, surely it is manifest only in the driver part:  xdvipdfmx

No, the problem are both parts. XeTeX can only write UTF-8,
the death for binary data.

> > Then I tried to be clever and a workaround by using
> > /D<c3a46e6368c3b872> for the link name in the source.
> > But it got converted and the PDF file still contains:
> > /D<feff00e4006e0063006800f80072>
> > 
> > Only the other way worked:
> > 
> >  \special{pdf:dest <feff00e4006e0063006800f80072> ...}
> >  \special{pdf:ann ... /D(änchør) ...}
> 
> OK. 
> Glad you did this test.
> It shows two things:
> 
>   1.  that such text strings may well be valid for Names,
>       and that the PDF spec. is unclear about this;

I can't follow. Both string representations are covered
by the PDF specification, a literal string can be
specified in parentheses with an escaping mechanism (backslash)
or given as hex string in angle brackets. Unclear is the
syntax of the argument for \special{pdf:dest ...}.

>   2.  these UTF16-BE strings are *not* equivalent to other
>       ways of encoding Name objects, after all.
> 
> This is something that should be reported as a bug to Adobe.

There is no problem with the PDF specification. A destination
name is a byte string. You can use UTF-16BE, invalid UTF-8,
a mixture of UTF-32BE with us-ascii, ... all are valid byte strings.
The problem is with xdvipdfmx that recodes the UTF-8 string
provided by XeTeX's specials in different ways.

> Can you produce a set of 3 or more PDFs that show the different 
> behaviours ?
> 
> Better still: a single PDF that illustrates the (non-)working
> of hyperlinks according to the encodings of the Name objects
> and Destinations.

Save my example as "test.tex" and run
"xetex --ini --output-driver='xdvipdfmx -V4' test"
(I miss an easy switch for XeTeX to set the PDF version).
With PDF-1.4 object stream compression is not available
and the PDF file can be analyzed directly using a simple
text viewer. (Otherwise the destination and annotation
objects are compressed).

> Do it both with XeTeX and pdfTeX (with appropriate inputenc, 
> to handle the UTF8 input), to test whether there are any 
> differences.  

pdfTeX is fine, because it doesn't reencode the strings.
Also \pdfescapestring, \pdfescapename, \pdfescapehex
are available for syntactically correct literal strings.

> I've not tested pdfTeX yet, because of the extra macro layer
> required. Does  hyperref  handle the required conversions then? 

It depends on which part of hyperref you are looking.

Yours sincerely
  Heiko Oberdiek