[XeTeX] Anchor names

Sat Nov 5 19:06:42 CET 2011

On Sat, Nov 05, 2011 at 04:14:03PM +0000, Jonathan Kew wrote:

> > Thanks Akira. But caution, it could break bookmark strings that
> > currently works more or less accidently, sometimes with warnings.
> 
> IIRC (it's a while since I looked at any of this), I believe Unicode
> bookmark strings work deliberately (not accidentally) - I think this came
> up early on as an issue, and encoding-form conversion was implemented to
> ensure that it works. (It's possible there are bugs, of course, but it was
> _supposed_ to work!)

The bookmark stuff suffers from the same main problem of arbitrary byte
strings. For example hyperref: hxetex.def is the only driver, where
I had to disable PDFDocEncoding.

> Yes, PDF is a binary format; xetex was not designed to write PDF. It
> writes its output as XDV - also a binary format, of course, but a very
> specific one designed for this purpose - and XDV provides an extension
> mechanism that involves writing "special" strings that a driver is
> expected to understand. The key issue is that the "special" strings xetex
> writes are Unicode strings, not byte strings.

As long as the \special supports a syntax that is free from "big chars"
this is not a problem. Example: PDF strings specified in hex form <...>.
If then the unhexed byte string is kept without reencoding, then
the problem would be solved, for instance. Thus XeTeX can be left
unchanged, the problem can be solved entirely in the driver:
* Providing a special syntax, where arbitrary byte stuff can be
  specified in us-ascii (hex form or other escape mechanisms).
* Further byte string processing without enforcing and reencoding
  to a special encoding.  

> >> It is a tool for processing
> >> text, and is specifically based on Unicode as the encoding for text, with
> >> UTF-8 being its default/preferred encoding form for Unicode, and (more
> >> importantly) the ONLY encoding form that it uses to write output files.
> >> It's possible to READ other encoding forms (UTF-16), or even other
> >> codepages, and have them mapped to Unicode internally, but output is
> >> always written as UTF-8.
> >> 
> >> Now, this should include not only .log file and \write output, but also
> >> text embedded in the .xdv output using \special. Remember that \special
> >> basically writes a sequence of *characters* to the output, and in xetex
> >> those characters are *Unicode* characters. So my expectation would be that
> >> arbitrary Unicode text can be written using \special, and will be
> >> represented using UTF-8 in the argument of the xxxN operation in .xdv. 
> > 
> > That means that arbitrary bytes can't be written using \special,
> > a restriction that is not available in vanilla TeX.
> 
> That's correct. Perhaps regrettable, but that was the design. The argument
> of \special{....} is ultimately represented, after macro expansion, etc,
> as (Unicode) text, and Unicode text != arbitrary bytes.

I don't criticize the way \special works with big chars. Of course, these
have to be encoded somehow to get byte data for storing in the .xdv
format. Arbitrary byte data can be encoded in many different
ways (hex, ascii85, \ooo, ...) to fit into a us-ascii string. This
way the restriction of XeTeX's \special does not matter at all.
The driver would then decode the string to get the byte string.
  The syntax for encoding arbitrary bytes is partially already present
(<>-hex-notation for strings), but partially missing (stream data).
And the main problem, the decoded strings gets reencoded and the
binary data destroyed in the process.

> >> If
> >> that \special is destined to be converted to a fragment of PDF data by the
> >> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form,
> >> I'd expect the driver to be responsible for that conversion.
> > 
> > Suggestions for some of PDF's data structures:
> > 
> > * Strings: It seems that both (...) and the hex form <...> can be
> >  used. In the hex form spaces are ignored, thus a space right
> >  after the opening angle could be used for a syntax extension.
> >  In this case the driver unescapes the hex string to get the
> >  byte string without reencoding to Unicode.
> >  Example:
> >  \special{pdf:dest < c3a46e6368c3b872> [...]}
> >    The destination name would be "änchør" as byte string in UTF-8.
> >  \special{pdf:dest < e46e6368f872> [...]}
> >    The destination name would be "änchør" as byte string in latin1.
> 
> I don't understand this proposal. How can you (or rather, a driver) tell
> which encoding is the intended interpretation of an arbitrary sequence of
> byte values?

The byte string data type of PDF doesn't have an encoding at all.
Applying an encoding is wrong in the first place and destroys the
data.

The conversion of UTF-8 strings of the special to PDFDocEncoding/UTF-16BE
would be an additional of the driver for *text strings*. But then
the driver has to *know* the string type of a given string
(text string, binary string, ascii string, string) to decide where
a conversion is allowed. That means implementing part of the PDF
specification in the driver. Also it must be clear for the user,
what happens with a string he provides.
  Much easier to implement/document is a "passing through" behaviour.
The application (macro package) constructs the correct strings
and they are just copied through the \special and driver to the PDF file.
The application knows, which string is a text string, which a byte string.
  If a different syntax exists for strings that should be converted
from UTF-8 to UTF-16BE+BOM then it simplifies the implementation of
the macro package and makes the implementation faster, because
the conversion doesn't need to be done at TeX level.

> >  \special{pdf:dest <c3a46e6368c3b872> [...]}
> >    The destination name would be the result of the current
> >    implementation.
> > 
> > * Streams (\special{pdf: object ...<<...>>stream...endstream}):
> >  Instead of the keyword "stream" "hexstream" could be introduced.
> >  The driver then takes a hex string, unhexes it to get the byte
> >  data for the stream, also without reencoding to Unicode.
> 
> I'm only vaguely aware of the various \special{}s that are supported by
> xdvipdfmx (this stuff is inherited from DVIPDFMx), but yes, I think that's
> where this issue should be fixed. But it _also_ needs the cooperation of
> macro package authors, in that macros designed to directly generate binary
> PDF streams and send them out via \special cannot be expected to work
> unchanged - they're assuming that the argument of \special{...} expands to
> a string of 8-bit bytes, not a string of Unicode characters, and that's
> not true in xetex.

Thus xdvipdfmx knows that it can only get UTF-8, not arbitrary 8-bit.
If it then allows a syntax in its supported specials that allows
arbitrary 8-bit, then also the macro package authors would be happy. :-)
Currently this is not the case.

Yours sincerely
  Heiko Oberdiek