[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc
Jonathan Kew
jfkthame at gmail.com
Mon May 4 17:27:52 CEST 2015
On 23/4/15 20:59, David Carlisle wrote:
> I can confirm that \string does convert character tokens
> to two tokens giving the UTF-16 representation.
>
> With the attached file luatex produces
>
> 90,33
> 34,33
> 233,33
> 233,33
> 65530,33
> 65537,33
> 65537,33
>
>
> which is in each case the unicode value of the character followed by
> that of !
>
> xetex produces
>
> 90,33
> 34,33
> 233,33
> 233,33
> 65530,33
> 55296,56321
> 55296,56321
>
>
> where the last two lines show that \string has generated U+D800 U+DC01
> which does correspond to the UTF-16 encoding of U+10001 confirming
> that \string on a character token has produced two tokens that have been
> picked up separately as #1 and #2 of the \test macro.
A fix for this bug, so that \string generates single Unicode characters
even for values above U+FFFF, is currently on the utf16-issues branch in
the XeTeX repository on sourceforge.[1]
A bug with characters above U+FFFF within \scantokens[2] is also fixed
on this branch.
There are also a couple of new primitives available in this branch:
(1) \Uchar <number>
where <number> is a number in the range 0.."10FFFF
is an expandable command that produces a character token with the given
Unicode value, and catcode=12 (other character). This is different from
TeX's \char primitive from a macro-programming point of view, in that it
expands to a character token rather than being a typesetting command.
(I believe this is similar to the \Uchar command available in luatex.)
(2) \Ucharcat <number1> <number2>
where <number1> is a number in the range 0.."10FFFF
and <number2> is a number in the ranges 1..4, 6..8, 10..12
is an expandable command that produces a character token with Unicode
value <number1> and catcode <number2>. This allows macro programmers to
create character tokens with various catcode assignments much more
easily than is otherwise possible.
Feedback and testing is invited; but note that currently this will
require pulling the code from sourceforge and building the new xetex, as
binary packages are not available.
If testing in the next day or two doesn't uncover any alarming problems,
these fixes/features will be merged to the master branch and to TeXLive,
in preparation for the TL2015 release.
JK
[1] https://sourceforge.net/p/xetex/code/ci/utf16-issues/tree/
[2] https://sourceforge.net/p/xetex/bugs/80/
More information about the XeTeX
mailing list