<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hi Khaled,<div><br></div><div>I would agree with you if the text was not encoded in unicode!</div><div>A properly encoded utf-8 string should contain everything you need!</div><div>Unfortunately, for efficiency reasons, utf-8 strings are not properly</div><div>encoded and programs assume a particular language, to save space.</div><div>In multi-language environments methods are used for efficiency to make</div><div>sure the system uses the correct language! </div><div><br></div><div>It is not the fault of utf-8, but the way it is implemented.  </div><div><br></div><div>As far as the methods you point to, they are for identify texts of unknown</div><div>origine and possibly of unknown encoding or an encoding that already has not identified</div><div>the language. <br><div><div>Am 09.12.2013 um 10:38 schrieb Khaled Hosny <<a href="mailto:khaledhosny@eglug.org">khaledhosny@eglug.org</a>>:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">On Mon, Dec 09, 2013 at 09:22:10AM +0100, Keith J. Schultz wrote:<br><blockquote type="cite">Hi Khaled,<br><br>your question can not be serious!<br></blockquote><br>No, it is.<br><br><blockquote type="cite">It is pretty much in the standard!<span class="Apple-converted-space"> </span><br></blockquote><br>No.<br><br><blockquote type="cite">True enough that for most western languages american, english, spanish,<br>german, austrian, etc. this is somewhat difficult. Yet, these are not causing the problems.<br></blockquote><br>You can’t identify the language of a Unicode string just by examining<br>the Unicode properties for the characters in that string, simply because<br>such Unicode property does not exist. Language identifications involves<br>quite some statistical analysis[1]. You can identify scripts using<br>Unicode properties quite reliably, though.<br><br>1.<span class="Apple-converted-space"> </span><a href="https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches">https://en.wikipedia.org/wiki/Language_identification#Statistical_approaches</a><br><br>Regards,<br>Khaled<br></div></blockquote></div>[snip, snip]</div></body></html>