[XeTeX] Anchor names
Jonathan Kew
jfkthame at googlemail.com
Sat Nov 5 17:14:03 CET 2011
On 5 Nov 2011, at 15:24, Heiko Oberdiek wrote:
> On Sat, Nov 05, 2011 at 02:45:32PM +0000, Jonathan Kew wrote:
>
>> On 5 Nov 2011, at 10:24, Akira Kakuto wrote:
>>
>>> Dear Heiko,
>>>
>>>>>>>> Conclusion:
>>>>>>>> * The encoding mess with 8-bit characters remain even with XeTeX.
>>>
>>> I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk r24508.
>>> Now
>>> /D<c3a46e6368c3b872>
>>> and
>>> /Names[<c3a46e6368c3b872>7 0 R]
>
> Thanks Akira. But caution, it could break bookmark strings that
> currently works more or less accidently, sometimes with warnings.
IIRC (it's a while since I looked at any of this), I believe Unicode bookmark strings work deliberately (not accidentally) - I think this came up early on as an issue, and encoding-form conversion was implemented to ensure that it works. (It's possible there are bugs, of course, but it was _supposed_ to work!)
> Perhaps the problem can be solved with a syntax extension, see below.
>
>> Unfortunately, I have not had time to follow this thread in detail or
>> investigate the issue properly, but I'm concerned this may break other
>> things that currently work, and rely on this conversion between the
>> encoding form in \specials, and the representation needed in PDF.
>>
>> However, by way of background: xetex was never intended to be a tool for
>> reading and writing arbitrary binary files.
>
> The PDF file format is a binary file format. To some degree us-ascii
> can be used, but at the cost of flexibility and some restrictions.
Yes, PDF is a binary format; xetex was not designed to write PDF. It writes its output as XDV - also a binary format, of course, but a very specific one designed for this purpose - and XDV provides an extension mechanism that involves writing "special" strings that a driver is expected to understand. The key issue is that the "special" strings xetex writes are Unicode strings, not byte strings.
>
>> It is a tool for processing
>> text, and is specifically based on Unicode as the encoding for text, with
>> UTF-8 being its default/preferred encoding form for Unicode, and (more
>> importantly) the ONLY encoding form that it uses to write output files.
>> It's possible to READ other encoding forms (UTF-16), or even other
>> codepages, and have them mapped to Unicode internally, but output is
>> always written as UTF-8.
>>
>> Now, this should include not only .log file and \write output, but also
>> text embedded in the .xdv output using \special. Remember that \special
>> basically writes a sequence of *characters* to the output, and in xetex
>> those characters are *Unicode* characters. So my expectation would be that
>> arbitrary Unicode text can be written using \special, and will be
>> represented using UTF-8 in the argument of the xxxN operation in .xdv.
>
> That means that arbitrary bytes can't be written using \special,
> a restriction that is not available in vanilla TeX.
That's correct. Perhaps regrettable, but that was the design. The argument of \special{....} is ultimately represented, after macro expansion, etc, as (Unicode) text, and Unicode text != arbitrary bytes.
>
>> If
>> that \special is destined to be converted to a fragment of PDF data by the
>> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form,
>> I'd expect the driver to be responsible for that conversion.
>
> Suggestions for some of PDF's data structures:
>
> * Strings: It seems that both (...) and the hex form <...> can be
> used. In the hex form spaces are ignored, thus a space right
> after the opening angle could be used for a syntax extension.
> In this case the driver unescapes the hex string to get the
> byte string without reencoding to Unicode.
> Example:
> \special{pdf:dest < c3a46e6368c3b872> [...]}
> The destination name would be "änchør" as byte string in UTF-8.
> \special{pdf:dest < e46e6368f872> [...]}
> The destination name would be "änchør" as byte string in latin1.
I don't understand this proposal. How can you (or rather, a driver) tell which encoding is the intended interpretation of an arbitrary sequence of byte values?
> \special{pdf:dest <c3a46e6368c3b872> [...]}
> The destination name would be the result of the current
> implementation.
>
> * Streams (\special{pdf: object ...<<...>>stream...endstream}):
> Instead of the keyword "stream" "hexstream" could be introduced.
> The driver then takes a hex string, unhexes it to get the byte
> data for the stream, also without reencoding to Unicode.
I'm only vaguely aware of the various \special{}s that are supported by xdvipdfmx (this stuff is inherited from DVIPDFMx), but yes, I think that's where this issue should be fixed. But it _also_ needs the cooperation of macro package authors, in that macros designed to directly generate binary PDF streams and send them out via \special cannot be expected to work unchanged - they're assuming that the argument of \special{...} expands to a string of 8-bit bytes, not a string of Unicode characters, and that's not true in xetex.
JK
More information about the XeTeX
mailing list