[XeTeX] xetex and the unicode bidirectional algorithm.

Tue Dec 10 11:40:29 CET 2013

Hi Phillip,

I will repeat I do not know Vietnamese so I can not give you
the utf-8 sequence for it. All I can say that in utf-8 the singular letters will
be encoded in multi-bytes whereas the english letters will be just one byte.

Now, i also, mentioned that differentiating western language poses a different matter!
"sang" in English and  "sang" in German an Austrian can not be singularly  deferentiated
as to which language it belongs to! All latin characters/letters. 
Now, if "sang" is true Vietnamese and not a latinized form stand corrected! Though I have 
a feeling it is latinized! If we are talking of the phonetic reprsentation, then a analysis
on text and belong singular text level is required. 

It has been mentioned by others that seems to be a lack of multi-lingual utf-8
editors(input methods) on the other side also, Xe(La)TeX lack of implementation of
properly handling the unicode standard. 

It is not the standard that is the problem, but the implementation of input and the
implementation of the output method. 

True enough, Unicode is not by far finish and is still evolving with all the cavets
involved. Yet, the problem here does arises out of the fact that the unicode standard
and utf-8 encoding/decoding is inadequate, but in its implementation.
The culprit is not utf-8!

Am 09.12.2013 um 23:51 schrieb Philip Taylor <P.Taylor at Rhul.Ac.Uk>:

> 
> 
> Keith J. Schultz wrote:
>> Hi Phillip,
>> 
>> 1) I do not know Vietnamese!
>> 
>> 2) If I did uses the proper BMP would give me the answer.
>>     As "sang would be a sequence of singualr octcets, and Vietnamese
>>     would use multi-byte sequences! 
>> 
>> case closed! Like I mentioned there are often ways used to reduce the length of
>> the multibyte sequences. In that case one has to know the processed use to get the proper
>> unicode character code!
> 
> It is not necessary to "know" a language in order to be able to
> algorithmically determine in which language a particular stretch
> of text is written, if such algorithmic determination is possible.
> I do not "know" Hebrew, but even I know that "בית דין‎" is Hebrew
> and that "你好" is not.  What I do not know (and what I challenge
> you to tell us" is whether "sang" is English or Vietnamese.
> 
> You wrote :  "for efficiency reasons, utf-8 strings are not properly
> encoded and programs assume a particular language, to save space."
> 
> I invited you to tell us (the XeTeX list members, that is) what
> would be a "properly encoded utf-8 string" for the sequence
> "sang" which would enable a computer algorithm to determine
> whether that string was "sang" (Vietnamese) or "sang" (English).
> 
> I am still hoping that you will be able to tell us what that
> properly encoded utf-8 string is, rather than just metaphorically
> waving your arms in the air while throwing around phrases such as
> "proper BMP", "singular octets" and "multi-byte sequences".
> 
> Philip Taylor
> 
> 
>