[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

David Carlisle d.p.carlisle at gmail.com
Thu May 7 10:43:32 CEST 2015

On 7 May 2015 at 02:07, Ross Moore <ross.moore at mq.edu.au> wrote:

> Hi David,
> ......

> No disagreement to this.

> In the current versions ^^^^d835^^^^dc00 is two characters in luatex
> and one character in xetex
> as the implementation detail that xetex's underlying storage is mostly
> UTF-16 is exposed.
> This seems to be premature of XeTeX then.
> It seems to be making an assumption on how those bytes
> will ultimately be used.

I don't think it's so much assuming that as just choosing to use UTF16
as an internal string format tends to lead that way. Unlike UTF-8, UTF-16
can not represent all code points in the 0-10FFFF range.
If  I switch to java(script) notation which does define numeric references
as utf-16 units rather than unicode code points, if you do not make it an
error you can encode an isolated surrogate such as "\ud835" but there
is no way to store the two character sequence U+D835 U+DC00
"\ud835\udc00" is the single character U+1D400, so you can only store
such character sequence if you store each text block as a sequence of
separate strings keeping unpaired surrogates apart "\ud835","\udc00"
which is a lot of effort for supporting input that should never appear.

> If it is
> not possible to prevent ^^^ or utf8 encoded surrogate pairs combining
> then it is better to
> prevent them being formed.
> Hmm.
> What if you have an entirely different purpose in mind for those bytes?
> You still need to be able to create them and do further processing with
> them.

luatex has a different mechanism for this, it allows utf8 encoding and ^^^
references to access the first 256 slots _above_ "10FFFF:

quoting the luatex manual:

Output in byte-sized chunks can be achieved by using characters just
> outside of the valid Unicode range,
> starting at the value 1 114 112 (0x110000). When the time comes to print a
> character c >= 1 114 112,
> LuaTeX will actually print the single byte corresponding to c minus
> 1,114,112.

This allows explicit byte-level access to file writing (so you can write
binary data such as images)
 without having to second guess and invert the character encoding the
system uses to write characters to a file.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20150507/35e9ad84/attachment.html>

More information about the XeTeX mailing list