[XeTeX] Anchor names

Sat Nov 5 16:24:47 CET 2011

On Sat, Nov 05, 2011 at 02:45:32PM +0000, Jonathan Kew wrote:

> On 5 Nov 2011, at 10:24, Akira Kakuto wrote:
> 
> > Dear Heiko,
> > 
> >>>>>> Conclusion:
> >>>>>> * The encoding mess with 8-bit characters remain even with XeTeX.
> > 
> > I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk r24508.
> > Now
> > /D<c3a46e6368c3b872>
> > and
> > /Names[<c3a46e6368c3b872>7 0 R]

Thanks Akira. But caution, it could break bookmark strings that
currently works more or less accidently, sometimes with warnings.
Perhaps the problem can be solved with a syntax extension, see below.

> Unfortunately, I have not had time to follow this thread in detail or
> investigate the issue properly, but I'm concerned this may break other
> things that currently work, and rely on this conversion between the
> encoding form in \specials, and the representation needed in PDF.
> 
> However, by way of background: xetex was never intended to be a tool for
> reading and writing arbitrary binary files.

The PDF file format is a binary file format. To some degree us-ascii
can be used, but at the cost of flexibility and some restrictions.

> It is a tool for processing
> text, and is specifically based on Unicode as the encoding for text, with
> UTF-8 being its default/preferred encoding form for Unicode, and (more
> importantly) the ONLY encoding form that it uses to write output files.
> It's possible to READ other encoding forms (UTF-16), or even other
> codepages, and have them mapped to Unicode internally, but output is
> always written as UTF-8.
> 
> Now, this should include not only .log file and \write output, but also
> text embedded in the .xdv output using \special. Remember that \special
> basically writes a sequence of *characters* to the output, and in xetex
> those characters are *Unicode* characters. So my expectation would be that
> arbitrary Unicode text can be written using \special, and will be
> represented using UTF-8 in the argument of the xxxN operation in .xdv. 

That means that arbitrary bytes can't be written using \special,
a restriction that is not available in vanilla TeX.

> If
> that \special is destined to be converted to a fragment of PDF data by the
> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form,
> I'd expect the driver to be responsible for that conversion.

Suggestions for some of PDF's data structures:

* Strings: It seems that both (...) and the hex form <...> can be
  used. In the hex form spaces are ignored, thus a space right
  after the opening angle could be used for a syntax extension.
  In this case the driver unescapes the hex string to get the
  byte string without reencoding to Unicode.
  Example:
  \special{pdf:dest < c3a46e6368c3b872> [...]}
    The destination name would be "änchør" as byte string in UTF-8.
  \special{pdf:dest < e46e6368f872> [...]}
    The destination name would be "änchør" as byte string in latin1.
  \special{pdf:dest <c3a46e6368c3b872> [...]}
    The destination name would be the result of the current
    implementation.

* Streams (\special{pdf: object ...<<...>>stream...endstream}):
  Instead of the keyword "stream" "hexstream" could be introduced.
  The driver then takes a hex string, unhexes it to get the byte
  data for the stream, also without reencoding to Unicode.

> What I would NOT expect to work is for a TeX macro package to generate
> arbitrary binary data (byte streams) and expect these to be passed
> unchanged to the output. I suspect that's what Heiko's macros probably do,
> and it worked in pdftex where "tex character" == "byte", but it's
> problematic when "tex character" == "Unicode character".

Yes, that's the problem. PDF is a binary format, not a Unicode text format.

Yours sincerely
  Heiko Oberdiek