[XeTeX] xetex and the unicode bidirectional algorithm.

Khaled Hosny khaledhosny at eglug.org
Mon Dec 9 14:58:46 CET 2013


On Mon, Dec 09, 2013 at 01:28:46PM +0100, Keith J. Schultz wrote:
> Hi Khaled,
> 
> I would agree with you if the text was not encoded in unicode!
> A properly encoded utf-8 string should contain everything you need!

No it doesn’t, otherwise please prove me wrong and till me how you can,
programatically, identify the language of this paragraph using Unicode
properties.

> Unfortunately, for efficiency reasons, utf-8 strings are not properly
> encoded and programs assume a particular language, to save space.
> In multi-language environments methods are used for efficiency to make
> sure the system uses the correct language! 
>
> It is not the fault of utf-8, but the way it is implemented.  

Encodings has nothing to do with language identification, you can always
convert text to Unicode prior to processing it.

> As far as the methods you point to, they are for identify texts of unknown
> origine and possibly of unknown encoding or an encoding that already has not identified
> the language. 

If the language of the text is already known (i.e. properly tagged
text), we don’t need to identify it.

> Am 09.12.2013 um 10:38 schrieb Khaled Hosny <khaledhosny at eglug.org>:
> 
> > On Mon, Dec 09, 2013 at 09:22:10AM +0100, Keith J. Schultz wrote:
> >> Hi Khaled,
> >> 
> >> your question can not be serious!
> > 
> > No, it is.
> > 
> >> It is pretty much in the standard! 
> > 
> > No.
> > 
> >> True enough that for most western languages american, english, spanish,
> >> german, austrian, etc. this is somewhat difficult. Yet, these are not causing the problems.
> > 
> > You can’t identify the language of a Unicode string just by examining
> > the Unicode properties for the characters in that string, simply because
> > such Unicode property does not exist. Language identifications involves
> > quite some statistical analysis[1]. You can identify scripts using
> > Unicode properties quite reliably, though.
> > 
> > 1. https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches
> > 
> > Regards,
> > Khaled
> [snip, snip]

> 
> 
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



More information about the XeTeX mailing list