[XeTeX] xetex and the unicode bidirectional algorithm.
P.Taylor at Rhul.Ac.Uk
Mon Dec 9 23:51:58 CET 2013
Keith J. Schultz wrote:
> Hi Phillip,
> 1) I do not know Vietnamese!
> 2) If I did uses the proper BMP would give me the answer.
> As "sang would be a sequence of singualr octcets, and Vietnamese
> would use multi-byte sequences!
> case closed! Like I mentioned there are often ways used to reduce the length of
> the multibyte sequences. In that case one has to know the processed use to get the proper
> unicode character code!
It is not necessary to "know" a language in order to be able to
algorithmically determine in which language a particular stretch
of text is written, if such algorithmic determination is possible.
I do not "know" Hebrew, but even I know that "בית דין" is Hebrew
and that "你好" is not. What I do not know (and what I challenge
you to tell us" is whether "sang" is English or Vietnamese.
You wrote : "for efficiency reasons, utf-8 strings are not properly
encoded and programs assume a particular language, to save space."
I invited you to tell us (the XeTeX list members, that is) what
would be a "properly encoded utf-8 string" for the sequence
"sang" which would enable a computer algorithm to determine
whether that string was "sang" (Vietnamese) or "sang" (English).
I am still hoping that you will be able to tell us what that
properly encoded utf-8 string is, rather than just metaphorically
waving your arms in the air while throwing around phrases such as
"proper BMP", "singular octets" and "multi-byte sequences".
More information about the XeTeX