[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

Arthur Reutenauer arthur.reutenauer at normalesup.org
Thu May 7 00:04:40 CEST 2015


  While working on these bugs, we also discussed how surrogate
characters were handled in XeTeX.  Surrogate characters are the 2048
code points that are used in UTF-16 to encode characters with code
points above 65536: a pair of them makes up one Unicode character;
however they're not meant to be used in isolation, even though they have
code points like other characters (they're not just byte sequences).

  Right now, XeTeX allows isolated surrogate characters, and also
combines sequences such as ^^^^d835^^^^dc00 into one Unicode character.
We want to flag the former case but are not sure how: should we make the
characters invalid (with catcode 15)?  Or we could map them to the
standard "unknown" character (U+FFFD).  The latter case is more nasty
and should definitely be forbidden -- the ^^ notation should only be
used for "proper" characters (so instead of the above, the Unicode code
point of the resulting Unicode character should be used, in this case
^^^^^1d400).

  Any thoughts?

	Best,

		Arthur


More information about the XeTeX mailing list