[XeTeX] Polyglossia: Automatic script detection

Tobias Schoel liesdiedatei at googlemail.com
Sun Mar 6 22:13:46 CET 2011

It seems to me, that the approach is flawed. The language, some text is 
written in, is an important, structural aspect of the text and thus 
should not be left to some machine to guess. The machine should do the 
formalistic work depending on the structural decisions made by the user. 
Thus, a user should say "this text is that language", just as 
polyglossia does by offering him the \text<lang>-commands.

The formalistic things, the computer should do then:
- choose the font (a parameter, the user should be able to change)
- change language specific behaviour
- some internal staff.

A sensible default for the language shouldn't come from the package, but 
from some system locale (not necessarily the os-locale, but maybe a 
xe(la)tex-specific locale or a locale defined by parameter to xe(la)tex)



Am 06.03.2011 19:33, schrieb mskala at ansuz.sooke.bc.ca:
> On Sun, 6 Mar 2011, Gerrit wrote:
>>>> the text. If the document has main language English and second language
>>> What is the "main language written in the text"?
>> Sorry, my English... I meant the main language, the text is written in. E.g.
>> an English article with some Russian text in it.
> What you wrote was acceptable English, but I don't understand how
> Polyglossia is supposed to detect the main language.  When there are
> several candidate languages that may be in use in the document, which one
> is the "main" one?  Is it the first one used?  Or the one that has the
> largest number of characters over the entire document (which necessitates
> an extra pass through the document)?
> It seems like you are saying, in order to detect the language, first we
> must detect the main language.  That's just changing the name of the
> problem.
>> Similarily, for a Chinese words in a Japanese text: The Chinese words will
>> then be written in a Japanese font (and Japanase simplified characters, when
>> necessary). e.g. ?? (Guangdong) will become ?? in Japanese, even though it
>> would be ?? in traditional Chinese.
> I haven't read enough Chinese/Japanese mixed documents to know whether
> that's commonly done in practice, but I wonder whether users would really
> consider it satisfactory.
>> I did not mean this method for determing the overall language of the document.
>> This is indeed much more complicated. But if we define German as the main
>> language (\setdefaultlanguage{german}) of the document, "table of contents"
> Okay - so "main" language is declared by the user?
>> Mixing text in the same script always poses a problem, but more in the field
>> of hyphenation, not so much in that of font changing. I guess, we do not want
>> to select a different latin font for French, written inside of a English
>> document? This would not look good.
> You might want to select a different Latin font for Polish.
> Instead of basing it on language, I'd rather allow the user to specify an
> automatic font switch for Unicode ranges: "use this font most of the time,
> but use this other font for U+3000 to U+30FF..."  Then if they are mixing
> languages that use different code points (such as English/Russian), they
> can get behaviour such as you describe; and if they are mixing languages
> that share character codes (such as Chinese/Japanese), then they have to
> use some other mechanism to mark up which is which, but they would have to
> anyway, so nothing has been lost.  Calling it "character code range"
> rather than "language" or "script" avoids making the false promise that we
> can correctly distinguish between languages or scripts.  I think some
> feature similar to this already exists, and expansions of it have
> certainly been proposed.

More information about the XeTeX mailing list