[XeTeX] printing of characters above "FFFF with \string \meaning (and potentially \Uchar)

Thu Apr 23 21:59:37 CEST 2015

I can confirm that \string does convert character tokens
to two tokens giving the UTF-16 representation.

With the attached file luatex produces

90,33
34,33
233,33
233,33
65530,33
65537,33
65537,33

which is in each case the unicode value of the character followed by that
of !

xetex produces

90,33
34,33
233,33
233,33
65530,33
55296,56321
55296,56321

where the last two lines show that \string has generated U+D800 U+DC01
which does correspond to the UTF-16 encoding of U+10001 confirming
that \string on a character token has produced two tokens that have been
picked up separately as #1 and #2 of the \test macro.

If I am reading it right the UTF-16 comes from here

procedure print_char(@!s:integer); {prints a single character}
label exit;
var l: small_number;
begin if (selector>pseudo) and (not doing_special) then
  {``printing'' to a new string, encode as UTF-16 rather than UTF-8}

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  begin if s>=@"10000 then
   begin print_visible_char(@"D800 + (s - @"10000) div @"400);
   print_visible_char(@"DC00 + (s - @"10000) mod @"400);
   end else print_visible_char(s);
   return;
  end;

so could not do that and instead just print_visible_char(s); but perhaps
some
other context requires UTF-16 in which case perhaps the selector needs
another
state to allow a code path that doesn't encode as UTF-8 or UTF-16 but just
generates
the internal UTF-32 representation?

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20150423/9029712f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nonbmp2.tex
Type: application/x-tex
Size: 451 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20150423/9029712f/attachment.tex>