[XeTeX] printing of characters above "FFFF with \string \meaning (and potentially \Uchar)
David Carlisle
d.p.carlisle at gmail.com
Thu Apr 23 21:59:37 CEST 2015
I can confirm that \string does convert character tokens
to two tokens giving the UTF-16 representation.
With the attached file luatex produces
90,33
34,33
233,33
233,33
65530,33
65537,33
65537,33
which is in each case the unicode value of the character followed by that
of !
xetex produces
90,33
34,33
233,33
233,33
65530,33
55296,56321
55296,56321
where the last two lines show that \string has generated U+D800 U+DC01
which does correspond to the UTF-16 encoding of U+10001 confirming
that \string on a character token has produced two tokens that have been
picked up separately as #1 and #2 of the \test macro.
If I am reading it right the UTF-16 comes from here
procedure print_char(@!s:integer); {prints a single character}
label exit;
var l: small_number;
begin if (selector>pseudo) and (not doing_special) then
{``printing'' to a new string, encode as UTF-16 rather than UTF-8}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
begin if s>=@"10000 then
begin print_visible_char(@"D800 + (s - @"10000) div @"400);
print_visible_char(@"DC00 + (s - @"10000) mod @"400);
end else print_visible_char(s);
return;
end;
so could not do that and instead just print_visible_char(s); but perhaps
some
other context requires UTF-16 in which case perhaps the selector needs
another
state to allow a code path that doesn't encode as UTF-8 or UTF-16 but just
generates
the internal UTF-32 representation?
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20150423/9029712f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nonbmp2.tex
Type: application/x-tex
Size: 451 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20150423/9029712f/attachment.tex>
More information about the XeTeX
mailing list