[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

Thu May 7 00:46:28 CEST 2015

On 6 May 2015 at 23:04, Arthur Reutenauer
<arthur.reutenauer at normalesup.org> wrote:
>   While working on these bugs, we also discussed how surrogate
> characters were handled in XeTeX.  Surrogate characters are the 2048
> code points that are used in UTF-16 to encode characters with code
> points above 65536: a pair of them makes up one Unicode character;
> however they're not meant to be used in isolation, even though they have
> code points like other characters (they're not just byte sequences).
>
>   Right now, XeTeX allows isolated surrogate characters, and also
> combines sequences such as ^^^^d835^^^^dc00 into one Unicode character.
> We want to flag the former case but are not sure how: should we make the
> characters invalid (with catcode 15)?  Or we could map them to the
> standard "unknown" character (U+FFFD).  The latter case is more nasty
> and should definitely be forbidden -- the ^^ notation should only be
> used for "proper" characters (so instead of the above, the Unicode code
> point of the resulting Unicode character should be used, in this case
> ^^^^^1d400).
>
>   Any thoughts?
>

A major difference between using catcode 15 and the engine's input
filter substituting
U+FFFD is that the former could be over-ridden at the macro layer.
Whether that's a good thing
or not depends a bit on what happens if a document puts the catcodes
back to (say) 12.

if you just get undefined characters and missing glyphs, then you get
what you ask for
and probably it should be allowed just because.  If the internals
can't reliably deal with an
unpaired surrogate (eg it crashes some font library api) then the
engine had better ensure
it doesn't easily happen and FFFD is as good as anything probably.

If you do go for catcode 15, then (as suggested in the thread on
unicode-letters.def)
it could be set in the macro layer or the engine could initialise
these catcodes.
Doing it at the macro layer is probably more in the spirit of the
traditional catcode initialisation
which is very minimalist.

As you say, combining ^^^^d835^^^^dc00 into one token just wrong,
and I think it should do (twice) whatever you decide to do for
unpaired surrogates.

David