[XeTeX] xetex and the unicode bidirectional algorithm.

Zdenek Wagner zdenek.wagner at gmail.com
Tue Dec 10 12:09:40 CET 2013

2013/12/10 Keith J. Schultz <keithjschultz at web.de>:
> Hi Phillip,
> I will repeat I do not know Vietnamese so I can not give you
> the utf-8 sequence for it. All I can say that in utf-8 the singular letters will
> be encoded in multi-bytes whereas the english letters will be just one byte.
It has no relation to English, it is just because these characters
have codepoints less than 128. In Czech some characters will be
encoded as one byte, some as two bytes. The character "s" may appear
in English, German, Czech, Hungarian, Spanish and many other
languages. You have not answered Phillip's question what is the utf-8
sequence to distinguish English "s" from Czech "s", from Vietnamese
"s", from Hungarian "s" etc.

> Now, i also, mentioned that differentiating western language poses a different matter!
> "sang" in English and  "sang" in German an Austrian can not be singularly  deferentiated
> as to which language it belongs to! All latin characters/letters.
> Now, if "sang" is true Vietnamese and not a latinized form stand corrected! Though I have
> a feeling it is latinized! If we are talking of the phonetic reprsentation, then a analysis
> on text and belong singular text level is required.
Yes, it is true Vietnamese word. I do not know Vietnamese, I could
only verify it by google translate but I know that Vietnamese uses
latin alphabet with accents. And of course, some words do not have
accents. It is the same in Czech, we also use accented characters but
many words do not have them. And for instance, strom in Czech has
different meaning that Strom in German.

> It has been mentioned by others that seems to be a lack of multi-lingual utf-8
> editors(input methods) on the other side also, Xe(La)TeX lack of implementation of
> properly handling the unicode standard.
Unicode is not a typographic standard and programs from the TeX world
deal with typography. If you want to achieve typographically good
output, you have to use language specific rules, ie tha languages must
be properly tagged. Once you tag the language, it will appear right in
the Xe(La)TeX output. If you are interested in Unicode only and not in
typography, why do you wish to use a typographic tool?

I can explain it another way. If you wish to connect two pieces of
wood, you can use either a nail or a screw. If you use a screw, you
must first make a hole and the screw the pieces. However, if you do
not like to make a hole and want to use a hammer only, why do you
bother with a screw and do not use a nail?

> It is not the standard that is the problem, but the implementation of input and the
> implementation of the output method.
> True enough, Unicode is not by far finish and is still evolving with all the cavets
> involved. Yet, the problem here does arises out of the fact that the unicode standard
> and utf-8 encoding/decoding is inadequate, but in its implementation.
> The culprit is not utf-8!
> Am 09.12.2013 um 23:51 schrieb Philip Taylor <P.Taylor at Rhul.Ac.Uk>:
>> Keith J. Schultz wrote:
>>> Hi Phillip,
>>> 1) I do not know Vietnamese!
>>> 2) If I did uses the proper BMP would give me the answer.
>>>     As "sang would be a sequence of singualr octcets, and Vietnamese
>>>     would use multi-byte sequences!
>>> case closed! Like I mentioned there are often ways used to reduce the length of
>>> the multibyte sequences. In that case one has to know the processed use to get the proper
>>> unicode character code!
>> It is not necessary to "know" a language in order to be able to
>> algorithmically determine in which language a particular stretch
>> of text is written, if such algorithmic determination is possible.
>> I do not "know" Hebrew, but even I know that "בית דין‎" is Hebrew
>> and that "你好" is not.  What I do not know (and what I challenge
>> you to tell us" is whether "sang" is English or Vietnamese.
>> You wrote :  "for efficiency reasons, utf-8 strings are not properly
>> encoded and programs assume a particular language, to save space."
>> I invited you to tell us (the XeTeX list members, that is) what
>> would be a "properly encoded utf-8 string" for the sequence
>> "sang" which would enable a computer algorithm to determine
>> whether that string was "sang" (Vietnamese) or "sang" (English).
>> I am still hoping that you will be able to tell us what that
>> properly encoded utf-8 string is, rather than just metaphorically
>> waving your arms in the air while throwing around phrases such as
>> "proper BMP", "singular octets" and "multi-byte sequences".
>> Philip Taylor
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex

Zdeněk Wagner

More information about the XeTeX mailing list