[XeTeX] xetex and the unicode bidirectional algorithm.

Keith J. Schultz schultzk at uni-trier.de
Mon Dec 9 13:28:46 CET 2013


Hi Khaled,

I would agree with you if the text was not encoded in unicode!
A properly encoded utf-8 string should contain everything you need!
Unfortunately, for efficiency reasons, utf-8 strings are not properly
encoded and programs assume a particular language, to save space.
In multi-language environments methods are used for efficiency to make
sure the system uses the correct language! 

It is not the fault of utf-8, but the way it is implemented.  

As far as the methods you point to, they are for identify texts of unknown
origine and possibly of unknown encoding or an encoding that already has not identified
the language. 
Am 09.12.2013 um 10:38 schrieb Khaled Hosny <khaledhosny at eglug.org>:

> On Mon, Dec 09, 2013 at 09:22:10AM +0100, Keith J. Schultz wrote:
>> Hi Khaled,
>> 
>> your question can not be serious!
> 
> No, it is.
> 
>> It is pretty much in the standard! 
> 
> No.
> 
>> True enough that for most western languages american, english, spanish,
>> german, austrian, etc. this is somewhat difficult. Yet, these are not causing the problems.
> 
> You can’t identify the language of a Unicode string just by examining
> the Unicode properties for the characters in that string, simply because
> such Unicode property does not exist. Language identifications involves
> quite some statistical analysis[1]. You can identify scripts using
> Unicode properties quite reliably, though.
> 
> 1. https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches
> 
> Regards,
> Khaled
[snip, snip]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20131209/88627930/attachment.html>


More information about the XeTeX mailing list