[XeTeX] Polyglossia: Automatic script detection

Avi Wollman avi.wollman at gmail.com
Tue Mar 8 11:03:54 CET 2011

A small note that in RTL languages  () {} [] swap. So I don't know how you
would implement it if they are from the "wrong" unicode block


On Sun, Mar 6, 2011 at 6:46 PM, Gerrit <z0idberg at gmx.de> wrote:

> Hello,
> I thought about this following (relatively simple) method for script
> detection (and thus font selection) in Polyglossia. This would drastically
> reduce the need to explicitely define the language for non-latin scripts.
> Basically, a list with scripts relating to unicode blocks is needed. For
> example:
> Latin: Basic Latin, Latin-1 Supplement, Spacing Modifier Letters, Combining
> Diacritical Marks, General Punctuation etc.
> Arabic: Arabic, Arabic Presentation Forms-A, Arabic Supplement etc.
> Cyrillic: Basic Latin, Cyrillic, etc.
> Japanese: Hiragana, Katakana, CJK Unified Ideographs, CJK Symbols and
> Punctuations, etc.
> I am not quite sure if Xetex offers the ability to find the value of the
> given character in the text, but if yes, this would not be that hard to
> implement: Just check in what block the character is, and then select the
> given script.
> We now have the problem, that one character may be present in more than one
> script and more than one language. If I write an English text and use a
> Russian sentence in this English text, the punctuation mark could belong to
> either language. Or imagine something like “The capital of Russia is Moscow
> (Москва́).“ Should the brackets then belong to English or Russian?
> In this case, Polyglossia could give priority to the main language written
> in the text. If the document has main language English and second language
> Russian, these brackets would then belong to English. But, if the bracket is
> between two russian words, it would be a Russian bracket.
> Following are some examples, with English written in small letters, Russian
> in capital letters (I do not know Russian, so I cannot create real examples)
> “the capital of russia is moscow (MOSCOW).” because there is only one
> russian word, all other punctuation marks would belong to english.
> “a famous russian saying is ‘TO BE, OR NOT TO BE’.” (I know this example is
> ... :D) – Because the comma is between two russian words, it would also be
> written in a cyrillic font. The ‘ and ’ would be in English, though, because
> in front of ‘ is latin, and after ’ is also latin.
> Ok, there may be other situations where the selection could be decided
> automatically. In some cases, explicit marking would be necessary, but this
> would be no different than today. A warning could be offered when rendering
> the document, that the character is ambigunous. For other languages, like
> Chinese or Japanese, punctuation marks are different to latin, so not that
> many rules would be necessary.
> In my opinion, this would make it easier to write multiple languages in one
> text. Mostly, the only necessity for language declaration would be in the
> header. Basing on that declaration, Polyglossia could detect the individual
> script and language. In case where this is not possible, manual declaration
> would be necessary.
> Gerrit
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex

Avi Wollman אבי וולמן
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20110308/cf852b71/attachment-0001.html>

More information about the XeTeX mailing list