[XeTeX] xetex and the unicode bidirectional algorithm.

Philip Taylor P.Taylor at Rhul.Ac.Uk
Mon Dec 9 23:51:58 CET 2013



Keith J. Schultz wrote:
> Hi Phillip,
> 
> 1) I do not know Vietnamese!
> 
> 2) If I did uses the proper BMP would give me the answer.
>      As "sang would be a sequence of singualr octcets, and Vietnamese
>      would use multi-byte sequences! 
> 
> case closed! Like I mentioned there are often ways used to reduce the length of
> the multibyte sequences. In that case one has to know the processed use to get the proper
> unicode character code!

It is not necessary to "know" a language in order to be able to
algorithmically determine in which language a particular stretch
of text is written, if such algorithmic determination is possible.
I do not "know" Hebrew, but even I know that "בית דין‎" is Hebrew
and that "你好" is not.  What I do not know (and what I challenge
you to tell us" is whether "sang" is English or Vietnamese.

You wrote :  "for efficiency reasons, utf-8 strings are not properly
encoded and programs assume a particular language, to save space."

I invited you to tell us (the XeTeX list members, that is) what
would be a "properly encoded utf-8 string" for the sequence
"sang" which would enable a computer algorithm to determine
whether that string was "sang" (Vietnamese) or "sang" (English).

I am still hoping that you will be able to tell us what that
properly encoded utf-8 string is, rather than just metaphorically
waving your arms in the air while throwing around phrases such as
"proper BMP", "singular octets" and "multi-byte sequences".

Philip Taylor





More information about the XeTeX mailing list