<div dir="ltr"><font face="comic sans ms,sans-serif">A small note that in RTL languages () {} [] swap. So I don't know how you would implement it if they are from the "wrong" unicode block</font><div><font face="comic sans ms,sans-serif"><br>
</font></div><div><font face="comic sans ms,sans-serif">Avi<br></font><br><div class="gmail_quote">On Sun, Mar 6, 2011 at 6:46 PM, Gerrit <span dir="ltr"><<a href="mailto:z0idberg@gmx.de">z0idberg@gmx.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Hello,<br>
<br>
I thought about this following (relatively simple) method for script detection (and thus font selection) in Polyglossia. This would drastically reduce the need to explicitely define the language for non-latin scripts.<br>
<br>
Basically, a list with scripts relating to unicode blocks is needed. For example:<br>
<br>
Latin: Basic Latin, Latin-1 Supplement, Spacing Modifier Letters, Combining Diacritical Marks, General Punctuation etc.<br>
Arabic: Arabic, Arabic Presentation Forms-A, Arabic Supplement etc.<br>
Cyrillic: Basic Latin, Cyrillic, etc.<br>
Japanese: Hiragana, Katakana, CJK Unified Ideographs, CJK Symbols and Punctuations, etc.<br>
<br>
I am not quite sure if Xetex offers the ability to find the value of the given character in the text, but if yes, this would not be that hard to implement: Just check in what block the character is, and then select the given script.<br>
<br>
We now have the problem, that one character may be present in more than one script and more than one language. If I write an English text and use a Russian sentence in this English text, the punctuation mark could belong to either language. Or imagine something like “The capital of Russia is Moscow (Москва́).“ Should the brackets then belong to English or Russian?<br>
<br>
In this case, Polyglossia could give priority to the main language written in the text. If the document has main language English and second language Russian, these brackets would then belong to English. But, if the bracket is between two russian words, it would be a Russian bracket.<br>
<br>
Following are some examples, with English written in small letters, Russian in capital letters (I do not know Russian, so I cannot create real examples)<br>
<br>
“the capital of russia is moscow (MOSCOW).” because there is only one russian word, all other punctuation marks would belong to english.<br>
<br>
“a famous russian saying is ‘TO BE, OR NOT TO BE’.” (I know this example is ... :D) – Because the comma is between two russian words, it would also be written in a cyrillic font. The ‘ and ’ would be in English, though, because in front of ‘ is latin, and after ’ is also latin.<br>
<br>
Ok, there may be other situations where the selection could be decided automatically. In some cases, explicit marking would be necessary, but this would be no different than today. A warning could be offered when rendering the document, that the character is ambigunous. For other languages, like Chinese or Japanese, punctuation marks are different to latin, so not that many rules would be necessary.<br>
<br>
In my opinion, this would make it easier to write multiple languages in one text. Mostly, the only necessity for language declaration would be in the header. Basing on that declaration, Polyglossia could detect the individual script and language. In case where this is not possible, manual declaration would be necessary.<br>
<br>
Gerrit<br>
<br>
<br>
--------------------------------------------------<br>
Subscriptions, Archive, and List information, etc.:<br>
<a href="http://tug.org/mailman/listinfo/xetex" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>
</blockquote></div><br><br clear="all"><br>-- <br>--------------------------------------------------------<br>Avi Wollman אבי וולמן<br><a href="http://www.google.com/profiles/avi.wollman">http://www.google.com/profiles/avi.wollman</a><br>
--------------------------------------------------------<br><br>
</div></div>