[XeTeX] Polyglossia: Automatic script detection
z0idberg at gmx.de
Sun Mar 6 17:46:23 CET 2011
I thought about this following (relatively simple) method for script
detection (and thus font selection) in Polyglossia. This would
drastically reduce the need to explicitely define the language for
Basically, a list with scripts relating to unicode blocks is needed. For
Latin: Basic Latin, Latin-1 Supplement, Spacing Modifier Letters,
Combining Diacritical Marks, General Punctuation etc.
Arabic: Arabic, Arabic Presentation Forms-A, Arabic Supplement etc.
Cyrillic: Basic Latin, Cyrillic, etc.
Japanese: Hiragana, Katakana, CJK Unified Ideographs, CJK Symbols and
I am not quite sure if Xetex offers the ability to find the value of the
given character in the text, but if yes, this would not be that hard to
implement: Just check in what block the character is, and then select
the given script.
We now have the problem, that one character may be present in more than
one script and more than one language. If I write an English text and
use a Russian sentence in this English text, the punctuation mark could
belong to either language. Or imagine something like “The capital of
Russia is Moscow (Москва́).“ Should the brackets then belong to English
In this case, Polyglossia could give priority to the main language
written in the text. If the document has main language English and
second language Russian, these brackets would then belong to English.
But, if the bracket is between two russian words, it would be a Russian
Following are some examples, with English written in small letters,
Russian in capital letters (I do not know Russian, so I cannot create
“the capital of russia is moscow (MOSCOW).” because there is only one
russian word, all other punctuation marks would belong to english.
“a famous russian saying is ‘TO BE, OR NOT TO BE’.” (I know this example
is ... :D) – Because the comma is between two russian words, it would
also be written in a cyrillic font. The ‘ and ’ would be in English,
though, because in front of ‘ is latin, and after ’ is also latin.
Ok, there may be other situations where the selection could be decided
automatically. In some cases, explicit marking would be necessary, but
this would be no different than today. A warning could be offered when
rendering the document, that the character is ambigunous. For other
languages, like Chinese or Japanese, punctuation marks are different to
latin, so not that many rules would be necessary.
In my opinion, this would make it easier to write multiple languages in
one text. Mostly, the only necessity for language declaration would be
in the header. Basing on that declaration, Polyglossia could detect the
individual script and language. In case where this is not possible,
manual declaration would be necessary.
More information about the XeTeX