[XeTeX] Polyglossia: Automatic script detection

mskala at ansuz.sooke.bc.ca mskala at ansuz.sooke.bc.ca
Sun Mar 6 19:33:05 CET 2011


On Sun, 6 Mar 2011, Gerrit wrote:
> > > the text. If the document has main language English and second language
> > What is the "main language written in the text"?
> >
> Sorry, my English... I meant the main language, the text is written in. E.g.
> an English article with some Russian text in it.

What you wrote was acceptable English, but I don't understand how
Polyglossia is supposed to detect the main language.  When there are
several candidate languages that may be in use in the document, which one
is the "main" one?  Is it the first one used?  Or the one that has the
largest number of characters over the entire document (which necessitates
an extra pass through the document)?

It seems like you are saying, in order to detect the language, first we
must detect the main language.  That's just changing the name of the
problem.

> Similarily, for a Chinese words in a Japanese text: The Chinese words will
> then be written in a Japanese font (and Japanase simplified characters, when
> necessary). e.g. ?? (Guangdong) will become ?? in Japanese, even though it
> would be ?? in traditional Chinese.

I haven't read enough Chinese/Japanese mixed documents to know whether
that's commonly done in practice, but I wonder whether users would really
consider it satisfactory.

> I did not mean this method for determing the overall language of the document.
> This is indeed much more complicated. But if we define German as the main
> language (\setdefaultlanguage{german}) of the document, "table of contents"

Okay - so "main" language is declared by the user?

> Mixing text in the same script always poses a problem, but more in the field
> of hyphenation, not so much in that of font changing. I guess, we do not want
> to select a different latin font for French, written inside of a English
> document? This would not look good.

You might want to select a different Latin font for Polish.

Instead of basing it on language, I'd rather allow the user to specify an
automatic font switch for Unicode ranges: "use this font most of the time,
but use this other font for U+3000 to U+30FF..."  Then if they are mixing
languages that use different code points (such as English/Russian), they
can get behaviour such as you describe; and if they are mixing languages
that share character codes (such as Chinese/Japanese), then they have to
use some other mechanism to mark up which is which, but they would have to
anyway, so nothing has been lost.  Calling it "character code range"
rather than "language" or "script" avoids making the false promise that we
can correctly distinguish between languages or scripts.  I think some
feature similar to this already exists, and expansions of it have
certainly been proposed.
-- 
Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/


More information about the XeTeX mailing list