[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

Thu May 7 01:06:00 CEST 2015

Hi Arthur,

On 07/05/2015, at 8:04, Arthur Reutenauer <arthur.reutenauer at normalesup.org> wrote:

>  While working on these bugs, we also discussed how surrogate
> characters were handled in XeTeX.  Surrogate characters are the 2048
> code points that are used in UTF-16 to encode characters with code
> points above 65536: a pair of them makes up one Unicode character;
> however they're not meant to be used in isolation, even though they have
> code points like other characters (they're not just byte sequences).
> 
>  Right now, XeTeX allows isolated surrogate characters, and also
> combines sequences such as ^^^^d835^^^^dc00 into one Unicode character.
> We want to flag the former case but are not sure how: should we make the
> characters invalid (with catcode 15)?  

That would definitely be wrong.
The character itself, as bytes that is, is not wrong and users should be able to create these.
But preferably through macros that ensure that they come correctly paired.

IMHO, this is a macro issue, not an engine issue.

The same kind of thing applies with combining accents and diacritics.
I've written macros that take an argument and follow it with a combining character.
This is useful for generating correct UTF8 bytes to put into XML packets, as needed for the XMP Metadata that is required in PDF files that must validate for ISO specifications.

Similar macros could be used to construct upper-plane characters from surrogates, given only the math style and Latin letter. For these, single surrogate characters will be needed in the macro definitions, with the ultimate matching pair to be determined algorithmically, probably using an \ifcase  instance. Single characters thus need to be able to be input, so as to create the macro definition.

OK, a clever macro programmer can change the catcodes to become valid local to the macro definition. But that is really complicating things.

> Or we could map them to the
> standard "unknown" character (U+FFFD).  The latter case is more nasty
> and should definitely be forbidden -- the ^^ notation should only be
> used for "proper" characters (so instead of the above, the Unicode code
> point of the resulting Unicode character should be used, in this case
> ^^^^^1d400).

I disagree. 
The ^^ notation can be used in macros to create the required bytes, for writing out into a file other than the  .dvi  or .pdf  output.
pdfTeX (or other engine) then can cause that file to become embedded as a file object stream in the final PDF.

> 
>  Any thoughts?
> 
>    Best,
> 
>        Arthur

Hope this helps,

    Ross