[XeTeX] xetex and the unicode bidirectional algorithm.

mskala at ansuz.sooke.bc.ca mskala at ansuz.sooke.bc.ca
Mon Dec 9 16:32:05 CET 2013

On Mon, 9 Dec 2013, Khaled Hosny wrote:
> >    U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067
> And it is a kind of tagging, so beyond the scope of identifying the
> language of *untagged* text (which is the claim that spurred all this
> discussion).

The claim was "A properly encoded utf-8 string should contain everything
you need!".  If you forbid using Unicode tag characters, then you're
saying "It is impossible to encode language in Unicode when you're not
allowed to use the features designed for that purpose," which is not
an interesting statement.

Yes, of course some kind of tagging is needed.  Keith seems to think that
the tagging will magically come from "proper" UTF-8, and of course he's
wrong.  I think language tagging would be possible in pure Unicode, as the
string above demonstrates, but that's not a good way to do it.  The really
original question had to do with RTL versus LTR detection, not language
detection, and that's a different issue.

Unicode specifies a way to detect RTL versus LTR, such that in many cases
it doesn't require tagging.  Unicode's way of doing it may or may not be a
good one, but we cannot reasonably pretend that it doesn't exist.  The
Unicode bidi algorithm does exist.  XeTeX does not implement the Unicode
bidi algorithm.  The interesting remaining question is whether XeTeX
should implement it.  I tend to think not - because if we implement it,
people will blame us for its failings.  It'd also be a lot of work, break
compatibility with the rest of the TeX world, STILL require tagging in
many cases, and so on.

Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.

More information about the XeTeX mailing list