[XeTeX] xetex and the unicode bidirectional algorithm.

C. Scott Ananian cscott at cscott.net
Tue Dec 10 17:11:27 CET 2013

On Tue, Dec 10, 2013 at 6:09 AM, Zdenek Wagner <zdenek.wagner at gmail.com> wrote:
> 2013/12/10 Keith J. Schultz <keithjschultz at web.de>:
>> I will repeat I do not know Vietnamese so I can not give you
>> Now, if "sang" is true Vietnamese and not a latinized form stand corrected! Though I have
> Yes, it is true Vietnamese word. I do not know Vietnamese, I could


..which is indeed the issue I am attempting to deal with (trying to
put the discussion back on track) -- a bunch of user authored content
which looks correct to a native speaker when using the unicode bidi
algorithm (implemented in the browser).  Language tags are only
applied sporadically when needed to correct some obvious issue --
although the future Visual Editor project at wikimedia hopes to make
language tagging a more integrated part of the editing process.

Language tagging uses the HTML <span lang="...." dir="...."> standard.
 Directionality tagging uses <bdo> and <bdi> where necessary.  But
again, the point of the bidi algorithm is to avoid the necessity of
manual tagging in many cases.

Ultimately, wikipedias goal is to allow the largest number of
individual authors the ability to create encyclopedic content in their
language as easily as possible.  Our greatest challenge is the "as
easily as possible" part.  We can't impose language tagging as a
barrier to entry, when it is not necessary for the author's text to be
readable and useful to the public.  We can encourage it in order to
obtain good hyphentation of embedded texts, but in our case that must
be an optional enhancement, not a requirement in order for the text to
be read.  (Which is why if we did do automated language guessing, it
would likely be primarily to *disable* hyphenation when we detect an
embedded text whose language differs from the one currently selected.
That is the safe option; we'll sacrifice some beauty but preserve the
legibility of the text -- which is our foremost concern.  We can't use
automated language guessing to second-guess the unicode bidi
algorithm, because the text *as it appears in the browser* is the text
which has been proof-read by our editors, and must be considered
canonically correct.)

                         ( http://cscott.net/ )

More information about the XeTeX mailing list