[XeTeX] Polyglossia: Automatic script detection

Sun Mar 6 17:46:23 CET 2011

Hello,

I thought about this following (relatively simple) method for script 
detection (and thus font selection) in Polyglossia. This would 
drastically reduce the need to explicitely define the language for 
non-latin scripts.

Basically, a list with scripts relating to unicode blocks is needed. For 
example:

Latin: Basic Latin, Latin-1 Supplement, Spacing Modifier Letters, 
Combining Diacritical Marks, General Punctuation etc.
Arabic: Arabic, Arabic Presentation Forms-A, Arabic Supplement etc.
Cyrillic: Basic Latin, Cyrillic, etc.
Japanese: Hiragana, Katakana, CJK Unified Ideographs, CJK Symbols and 
Punctuations, etc.

I am not quite sure if Xetex offers the ability to find the value of the 
given character in the text, but if yes, this would not be that hard to 
implement: Just check in what block the character is, and then select 
the given script.

We now have the problem, that one character may be present in more than 
one script and more than one language. If I write an English text and 
use a Russian sentence in this English text, the punctuation mark could 
belong to either language. Or imagine something like “The capital of 
Russia is Moscow (Москва́).“ Should the brackets then belong to English 
or Russian?

In this case, Polyglossia could give priority to the main language 
written in the text. If the document has main language English and 
second language Russian, these brackets would then belong to English. 
But, if the bracket is between two russian words, it would be a Russian 
bracket.

Following are some examples, with English written in small letters, 
Russian in capital letters (I do not know Russian, so I cannot create 
real examples)

“the capital of russia is moscow (MOSCOW).” because there is only one 
russian word, all other punctuation marks would belong to english.

“a famous russian saying is ‘TO BE, OR NOT TO BE’.” (I know this example 
is ... :D) – Because the comma is between two russian words, it would 
also be written in a cyrillic font. The ‘ and ’ would be in English, 
though, because in front of ‘ is latin, and after ’ is also latin.

Ok, there may be other situations where the selection could be decided 
automatically. In some cases, explicit marking would be necessary, but 
this would be no different than today. A warning could be offered when 
rendering the document, that the character is ambigunous. For other 
languages, like Chinese or Japanese, punctuation marks are different to 
latin, so not that many rules would be necessary.

In my opinion, this would make it easier to write multiple languages in 
one text. Mostly, the only necessity for language declaration would be 
in the header. Basing on that declaration, Polyglossia could detect the 
individual script and language. In case where this is not possible, 
manual declaration would be necessary.

Gerrit