[XeTeX] Polyglossia: Automatic script detection

mskala at ansuz.sooke.bc.ca mskala at ansuz.sooke.bc.ca
Sun Mar 6 18:28:46 CET 2011


On Sun, 6 Mar 2011, Gerrit wrote:
> In this case, Polyglossia could give priority to the main language written in
> the text. If the document has main language English and second language

What is the "main language written in the text"?

> the document, that the character is ambigunous. For other languages, like
> Chinese or Japanese, punctuation marks are different to latin, so not that
> many rules would be necessary.

Chinese and Japanese are easy to distinguish from English, but not so easy
to distinguish from each other.  A Chinese text will usually contain
characters that couldn't be Japanese, and a Japanese text will almost
always contain characters that couldn't be Chinese, but it's possible to
construct nontrivial text fragments in either of those languages using
only characters common to the two.  Similar issues exist between all pairs
of languages that are written in very similar scripts - such as
English/French, Russian/Ukranian, and so on.  I don't know if Russian and
Ukranian might be similar enough we could get away with lumping them
together, but it may be necessary to distinguish Japanese from Chinese
because of different character forms, English from French to produce the
right punctuation spacing, Romanian from others because of the
cedilla/comma accent issue, Czech and Polish from others because of hacek
and kreska, and almost every language from almost every other to choose
the right translations for words like "Section" and "Figure."

It seems to me that this kind of auto-detection based on character usage
can only ever *sometimes* work.  Smarter ways of guessing language based
on bigger units than individual characters (for instance, looking for the
presence or absence of common words) do exist, but those will break on
some texts too.  Perhaps you only intended to distinguish general script
families (like Latin from Cyrillic), not languages (like English from
Russian), but I think Polyglossia needs to distinguish languages, even
when it's limited to font selection only.  I don't really object to
autodetection as long as it's only a deprecated default to make things
easier for users who don't know any better, but anybody who actually cares
what language they are writing *must* specify it manually or the system
will, inevitably, make a wrong guess eventually.
-- 
Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/


More information about the XeTeX mailing list