Jonathan Kew
Sat Nov 5 15:45:32 CET 2011

On 5 Nov 2011, at 10:24, Akira Kakuto wrote:

> Dear Heiko,
>>>>>> Conclusion:
>>>>>> * The encoding mess with 8-bit characters remain even with XeTeX.
I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk r24508.
> Now
/D<c3a46e6368c3b872>
> and
/Names[<c3a46e6368c3b872>7 0 R]
> Thanks,
Akira

Unfortunately, I have not had time to follow this thread in detail or investigate the issue properly, but I'm concerned this may break other things that currently work, and rely on this conversion between the encoding form in \specials, and the representation needed in PDF.

However, by way of background: xetex was never intended to be a tool for reading and writing arbitrary binary files. It is a tool for processing text, and is specifically based on Unicode as the encoding for text, with UTF-8 being its default/preferred encoding form for Unicode, and (more importantly) the ONLY encoding form that it uses to write output files. It's possible to READ other encoding forms (UTF-16), or even other codepages, and have them mapped to Unicode internally, but output is always written as UTF-8.

Now, this should include not only .log file and \write output, but also text embedded in the .xdv output using \special. Remember that \special basically writes a sequence of *characters* to the output, and in xetex those characters are *Unicode* characters. So my expectation would be that arbitrary Unicode text can be written using \special, and will be represented using UTF-8 in the argument of the xxxN operation in .xdv. If that \special is destined to be converted to a fragment of PDF data by the xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form, I'd expect the driver to be responsible for that conversion.

What I would NOT expect to work is for a TeX macro package to generate arbitrary binary data (byte streams) and expect these to be passed unchanged to the output. I suspect that's what Heiko's macros probably do, and it worked in pdftex where "tex character" == "byte", but it's problematic when "tex character" == "Unicode character".


