[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

Thu May 7 01:26:34 CEST 2015

> The character itself, as bytes that is, is not wrong and users should be able to create these.
> But preferably through macros that ensure that they come correctly paired.

placing two character tokens representing a surrogate pair should not
though magically turn itself
into a single character. The UTF-8 or ^^^^ encoding should refer to
the unicode code point not
to the UTF-16 encoding,

In the current versions ^^^^d835^^^^dc00 is two characters in luatex
and one character in xetex
as the implementation detail that xetex's underlying storage is mostly
UTF-16 is exposed. If it is
not possible to prevent ^^^ or utf8 encoded surrogate pairs combining
then it is better to
prevent them being formed.

this is no different to XML where & #xd835;& #xdc00; always refers to
two (invalid) characters not
to & #x1d400;

David