[XeTeX] xetex and the unicode bidirectional algorithm.

Mon Dec 9 16:56:15 CET 2013

In my particular case, I have citations in (for example) the arabic
wikipedia, which cite references on English or Turkish webpages (to
cite the example of the arwiki article on 'Istanbul').  The original
author of the article did not explicitly mark the language of the
reference, because the unicode bidirectional algorithm did a perfect
job of rendering the cited page title LTR in an otherwise RTL context.
 When I translate this to XeLaTeX, the entire citation is garbled
because, although XeLaTeX/polyglossia does render the individual words
LTR (using directionality implied from the unicode code block), the
individual words are laid out RTL and the punctuation is a mess,
because XeLaTeX does not implement the bidir algorithm's mechanism for
inferring the directionality of 'weak' and 'soft' characters.   (The
original citations also don't necessarily add <bdi> tags where
necessary, but that appears to be an easily fixed fault of the
citation template.)

My understanding from this discussion is that I should implement the
unicode bidi algorithm myself in my article preprocessor, to
explicitly annotate the directionality of soft characters before
feeding the output to xelatex.  That work won't help others who find
themselves in a similar situation (or document authors who would
prefer not to have to explicitly annotate every LTR embedding), but it
should be a reasonable solution to my particular problem.
 --scott

On Mon, Dec 9, 2013 at 10:32 AM,  <mskala at ansuz.sooke.bc.ca> wrote:
> On Mon, 9 Dec 2013, Khaled Hosny wrote:
>> >    U+E0001 U+E0065 U+E006E U+0073 U+0061 U+006E U+0067
>>
>> And it is a kind of tagging, so beyond the scope of identifying the
>> language of *untagged* text (which is the claim that spurred all this
>> discussion).
>
> The claim was "A properly encoded utf-8 string should contain everything
> you need!".  If you forbid using Unicode tag characters, then you're
> saying "It is impossible to encode language in Unicode when you're not
> allowed to use the features designed for that purpose," which is not
> an interesting statement.
>
> Yes, of course some kind of tagging is needed.  Keith seems to think that
> the tagging will magically come from "proper" UTF-8, and of course he's
> wrong.  I think language tagging would be possible in pure Unicode, as the
> string above demonstrates, but that's not a good way to do it.  The really
> original question had to do with RTL versus LTR detection, not language
> detection, and that's a different issue.
>
> Unicode specifies a way to detect RTL versus LTR, such that in many cases
> it doesn't require tagging.  Unicode's way of doing it may or may not be a
> good one, but we cannot reasonably pretend that it doesn't exist.  The
> Unicode bidi algorithm does exist.  XeTeX does not implement the Unicode
> bidi algorithm.  The interesting remaining question is whether XeTeX
> should implement it.  I tend to think not - because if we implement it,
> people will blame us for its failings.  It'd also be a lot of work, break
> compatibility with the rest of the TeX world, STILL require tagging in
> many cases, and so on.
>
> --
> Matthew Skala
> mskala at ansuz.sooke.bc.ca                 People before principles.
> http://ansuz.sooke.bc.ca/
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex

-- 
                         ( http://cscott.net/ )