[XeTeX] xetex and the unicode bidirectional algorithm.

Mon Dec 9 23:30:43 CET 2013

On Mon, Dec 09, 2013 at 09:32:05AM -0600, mskala at ansuz.sooke.bc.ca wrote:
> On Mon, 9 Dec 2013, Khaled Hosny wrote:
> > >    U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067
> >
> > And it is a kind of tagging, so beyond the scope of identifying the
> > language of *untagged* text (which is the claim that spurred all this
> > discussion).
> 
> The claim was "A properly encoded utf-8 string should contain everything
> you need!".

You are reading too much into this statement. The original claim was
that you don’t need to tag a Unicode string to be able to identify its
language, which is not the case, and your method is just a (deprecated)
form of tagging, so it does not prove the original claim.

> If you forbid using Unicode tag characters, then you're
> saying "It is impossible to encode language in Unicode when you're not
> allowed to use the features designed for that purpose," which is not
> an interesting statement.

I’m not forbidding anything, but the grand OP’s issue was that he cannot
manually tag the text, and I don’t see how changing the form of tagging
solves anything, since one still needs to do it manually.

> Yes, of course some kind of tagging is needed.  Keith seems to think that
> the tagging will magically come from "proper" UTF-8, and of course he's
> wrong.  I think language tagging would be possible in pure Unicode, as the
> string above demonstrates, but that's not a good way to do it.  The really
> original question had to do with RTL versus LTR detection, not language
> detection, and that's a different issue.

We are not even limited to plain text, since we are dealing with
Wikipedia article here, which is a tagged text, so what form of tagging
to use is not even an issue. The tagging itself is the issue.

> Unicode specifies a way to detect RTL versus LTR, such that in many cases
> it doesn't require tagging.

Right, and the grand OP was adviced to use that, and it is very
reliable, but it solves half the issue, since it does not help with
language tagging that is needed for other things like hyphenation
patterns or using different typographic convention, different fonts and
so on for different language, which IMO is a requirement for any
typesetting job for anything but the most trivial of texts.

> Unicode's way of doing it may or may not be a good one, but we cannot
> reasonably pretend that it doesn't exist.  The Unicode bidi algorithm
> does exist.  XeTeX does not implement the Unicode bidi algorithm.

No one claimed that in the whole thread, so I’m not sure what you are
trying to disprove here.

> The interesting remaining question is whether XeTeX should implement
> it.  I tend to think not - because if we implement it, people will
> blame us for its failings.  It'd also be a lot of work, break
> compatibility with the rest of the TeX world, STILL require tagging in
> many cases, and so on.

To the contrary, I think XeTeX should, but it is not a trivial job and
the so unlikely to be done.

Regards,
Khaled