[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

Jonathan Kew jfkthame at gmail.com
Mon May 4 17:27:52 CEST 2015


On 23/4/15 20:59, David Carlisle wrote:
> I can confirm that \string does convert character tokens
> to two tokens giving the UTF-16 representation.
>
> With the attached file luatex produces
>
> 90,33
> 34,33
> 233,33
> 233,33
> 65530,33
> 65537,33
> 65537,33
>
>
> which is in each case the unicode value of the character followed by
> that of !
>
> xetex produces
>
> 90,33
> 34,33
> 233,33
> 233,33
> 65530,33
> 55296,56321
> 55296,56321
>
>
> where the last two lines show that \string has generated U+D800 U+DC01
> which does correspond to the UTF-16 encoding of U+10001 confirming
> that \string on a character token has produced two tokens that have been
> picked up separately as #1 and #2 of the \test macro.

A fix for this bug, so that \string generates single Unicode characters 
even for values above U+FFFF, is currently on the utf16-issues branch in 
the XeTeX repository on sourceforge.[1]

A bug with characters above U+FFFF within \scantokens[2] is also fixed 
on this branch.


There are also a couple of new primitives available in this branch:

(1) \Uchar <number>

     where <number> is a number in the range 0.."10FFFF

is an expandable command that produces a character token with the given 
Unicode value, and catcode=12 (other character). This is different from 
TeX's \char primitive from a macro-programming point of view, in that it 
expands to a character token rather than being a typesetting command.

(I believe this is similar to the \Uchar command available in luatex.)


(2) \Ucharcat <number1> <number2>

     where <number1> is a number in the range 0.."10FFFF
     and <number2> is a number in the ranges 1..4, 6..8, 10..12

is an expandable command that produces a character token with Unicode 
value <number1> and catcode <number2>. This allows macro programmers to 
create character tokens with various catcode assignments much more 
easily than is otherwise possible.


Feedback and testing is invited; but note that currently this will 
require pulling the code from sourceforge and building the new xetex, as 
binary packages are not available.

If testing in the next day or two doesn't uncover any alarming problems, 
these fixes/features will be merged to the master branch and to TeXLive, 
in preparation for the TL2015 release.

JK


[1] https://sourceforge.net/p/xetex/code/ci/utf16-issues/tree/
[2] https://sourceforge.net/p/xetex/bugs/80/



More information about the XeTeX mailing list