[XeTeX] Anchor names

Sat Nov 5 17:14:03 CET 2011

On 5 Nov 2011, at 15:24, Heiko Oberdiek wrote:

> On Sat, Nov 05, 2011 at 02:45:32PM +0000, Jonathan Kew wrote:
> 
>> On 5 Nov 2011, at 10:24, Akira Kakuto wrote:
>> 
>>> Dear Heiko,
>>> 
>>>>>>>> Conclusion:
>>>>>>>> * The encoding mess with 8-bit characters remain even with XeTeX.
>>> 
>>> I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk r24508.
>>> Now
>>> /D<c3a46e6368c3b872>
>>> and
>>> /Names[<c3a46e6368c3b872>7 0 R]
> 
> Thanks Akira. But caution, it could break bookmark strings that
> currently works more or less accidently, sometimes with warnings.

IIRC (it's a while since I looked at any of this), I believe Unicode bookmark strings work deliberately (not accidentally) - I think this came up early on as an issue, and encoding-form conversion was implemented to ensure that it works. (It's possible there are bugs, of course, but it was _supposed_ to work!)

> Perhaps the problem can be solved with a syntax extension, see below.
> 
>> Unfortunately, I have not had time to follow this thread in detail or
>> investigate the issue properly, but I'm concerned this may break other
>> things that currently work, and rely on this conversion between the
>> encoding form in \specials, and the representation needed in PDF.
>> 
>> However, by way of background: xetex was never intended to be a tool for
>> reading and writing arbitrary binary files.
> 
> The PDF file format is a binary file format. To some degree us-ascii
> can be used, but at the cost of flexibility and some restrictions.

Yes, PDF is a binary format; xetex was not designed to write PDF. It writes its output as XDV - also a binary format, of course, but a very specific one designed for this purpose - and XDV provides an extension mechanism that involves writing "special" strings that a driver is expected to understand. The key issue is that the "special" strings xetex writes are Unicode strings, not byte strings.

> 
>> It is a tool for processing
>> text, and is specifically based on Unicode as the encoding for text, with
>> UTF-8 being its default/preferred encoding form for Unicode, and (more
>> importantly) the ONLY encoding form that it uses to write output files.
>> It's possible to READ other encoding forms (UTF-16), or even other
>> codepages, and have them mapped to Unicode internally, but output is
>> always written as UTF-8.
>> 
>> Now, this should include not only .log file and \write output, but also
>> text embedded in the .xdv output using \special. Remember that \special
>> basically writes a sequence of *characters* to the output, and in xetex
>> those characters are *Unicode* characters. So my expectation would be that
>> arbitrary Unicode text can be written using \special, and will be
>> represented using UTF-8 in the argument of the xxxN operation in .xdv. 
> 
> That means that arbitrary bytes can't be written using \special,
> a restriction that is not available in vanilla TeX.

That's correct. Perhaps regrettable, but that was the design. The argument of \special{....} is ultimately represented, after macro expansion, etc, as (Unicode) text, and Unicode text != arbitrary bytes.

> 
>> If
>> that \special is destined to be converted to a fragment of PDF data by the
>> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form,
>> I'd expect the driver to be responsible for that conversion.
> 
> Suggestions for some of PDF's data structures:
> 
> * Strings: It seems that both (...) and the hex form <...> can be
>  used. In the hex form spaces are ignored, thus a space right
>  after the opening angle could be used for a syntax extension.
>  In this case the driver unescapes the hex string to get the
>  byte string without reencoding to Unicode.
>  Example:
>  \special{pdf:dest < c3a46e6368c3b872> [...]}
>    The destination name would be "änchør" as byte string in UTF-8.
>  \special{pdf:dest < e46e6368f872> [...]}
>    The destination name would be "änchør" as byte string in latin1.

I don't understand this proposal. How can you (or rather, a driver) tell which encoding is the intended interpretation of an arbitrary sequence of byte values?

>  \special{pdf:dest <c3a46e6368c3b872> [...]}
>    The destination name would be the result of the current
>    implementation.
> 
> * Streams (\special{pdf: object ...<<...>>stream...endstream}):
>  Instead of the keyword "stream" "hexstream" could be introduced.
>  The driver then takes a hex string, unhexes it to get the byte
>  data for the stream, also without reencoding to Unicode.

I'm only vaguely aware of the various \special{}s that are supported by xdvipdfmx (this stuff is inherited from DVIPDFMx), but yes, I think that's where this issue should be fixed. But it _also_ needs the cooperation of macro package authors, in that macros designed to directly generate binary PDF streams and send them out via \special cannot be expected to work unchanged - they're assuming that the argument of \special{...} expands to a string of 8-bit bytes, not a string of Unicode characters, and that's not true in xetex.

JK