[XeTeX] xetex and the unicode bidirectional algorithm.

mskala at ansuz.sooke.bc.ca mskala at ansuz.sooke.bc.ca
Mon Dec 9 15:16:03 CET 2013


On Mon, 9 Dec 2013, Philip Taylor wrote:
> Keith -- could you possible supply an example of
> "a properly encoded utf-8 string" from which it
> can be unambiguously determined whether the string
> "sang" is an English word (the past tense of "sing")

I'll probably regret pointing this out, and the characters involved have
been deprecated since Unicode 5, but:

   U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067

or in UTF-8 bytes:

   f3 a0 80 81 f3 a0 81 a5 f3 a0 81 ae 73 61 6e 67

The Web form you mentioned sanitizes away the special characters.  I don't
think that's unique to "tags" - it seems to also block everything outside
the Basic Multilingual Plane.  Bad form for something claiming to be an
authoritative analyser of Unicode strings.
-- 
Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/


More information about the XeTeX mailing list