# [XeTeX] Polyglossia: Automatic script detection

Gerrit z0idberg at gmx.de
Sun Mar 6 19:07:25 CET 2011

Am 06.03.2011 18:28, schrieb mskala at ansuz.sooke.bc.ca:
> On Sun, 6 Mar 2011, Gerrit wrote:
>> In this case, Polyglossia could give priority to the main language written in
>> the text. If the document has main language English and second language
> What is the "main language written in the text"?
>
Sorry, my English... I meant the main language, the text is written in.
E.g. an English article with some Russian text in it.

>> the document, that the character is ambigunous. For other languages, like
>> Chinese or Japanese, punctuation marks are different to latin, so not that
>> many rules would be necessary.
> Chinese and Japanese are easy to distinguish from English, but not so easy
> to distinguish from each other.  A Chinese text will usually contain
> characters that couldn't be Japanese, and a Japanese text will almost
> always contain characters that couldn't be Chinese, but it's possible to
> construct nontrivial text fragments in either of those languages using
> only characters common to the two.  Similar issues exist between all pairs
> of languages that are written in very similar scripts - such as
> English/French, Russian/Ukranian, and so on.  I don't know if Russian and
> Ukranian might be similar enough we could get away with lumping them
> together, but it may be necessary to distinguish Japanese from Chinese
> because of different character forms, English from French to produce the
> right punctuation spacing, Romanian from others because of the
> cedilla/comma accent issue, Czech and Polish from others because of hacek
> and kreska, and almost every language from almost every other to choose
> the right translations for words like "Section" and "Figure."
changing reasons. If we have English and French mixed, the font does not
need to be changed. Hyphenation is different, though.
Similarily, for a Chinese words in a Japanese text: The Chinese words
will then be written in a Japanese font (and Japanase simplified
characters, when necessary). e.g. ?? (Guangdong) will become ?? in
Japanese, even though it would be ?? in traditional Chinese.
In this case, for font reasons, the language is not that important, but
rather the script.

I did not mean this method for determing the overall language of the
document. This is indeed much more complicated. But if we define German
as the main language (\setdefaultlanguage{german}) of the document,
we then define Japanese as a secondary
language(\setotherlanguages{japanese}), we can write
"Tokio (\textjapanese{??]) ist die Hauptstadt Japans."

Of course, we then have a problem if we want to write
"Tokio (??) ist die Hauptstadt Japans und Peking (??) die Chinas",
because Polyglossia then does not know if ?? is Japanese or Chinese (of
course we need to define Japanese and Chinese as other languages).
Polyglossia could use the order of the other languages (first Japanese,
then Chinese), so that we only have to write:
"Tokio (??) ist die Hauptstadt Japans und Peking (\textchinese{??}) die
Chinas."

Mixing text in the same script always poses a problem, but more in the
field of hyphenation, not so much in that of font changing. I guess, we
do not want to select a different latin font for French, written inside
of a English document? This would not look good.

> It seems to me that this kind of auto-detection based on character usage
> can only ever *sometimes* work.  Smarter ways of guessing language based
> on bigger units than individual characters (for instance, looking for the
> presence or absence of common words) do exist, but those will break on
> some texts too.  Perhaps you only intended to distinguish general script
> families (like Latin from Cyrillic), not languages (like English from
> Russian), but I think Polyglossia needs to distinguish languages, even
> when it's limited to font selection only.  I don't really object to
> autodetection as long as it's only a deprecated default to make things
> easier for users who don't know any better, but anybody who actually cares
> what language they are writing *must* specify it manually or the system
> will, inevitably, make a wrong guess eventually.

Well, of course, contextual recognition is much more better, especially
for languages using the same script. But this is really hard to
implement, and is even more error-prone. Is "also" English or German? Is
?? the Chinese word for China (Zho-ngguó) , or is it rather the Chu-goku
region in Japan? My goal was to have this autodetection for texts in
which occasionally one or two words in another language appear, not
entire sentences. If we have a complete, longer quote, maybe an entire
paragraph, it should still be no problem to write \begin{french} ...
\end{french}

Autodetection is really hard, and I think it is too fuzzy for something
like Latex, which to some degree tries not to be that fuzzy. If we had
autodetection based on words or sentences, we would need a large
dictionary, which also has to be up to date. This is quite some task.
Therefore, I thought that implementing this relatively easy
script-autodetection would just be more realistic for the time being.
The problem arises if we have French in an English text, or Japanese
/and /Chinese in an English text, but if we only have Japanese or Arabic
in an English text, the implementation may be realitively easy. If I
write an English text and will insert ?? into it, the system will have
no chance to make a wrong guess (provided I specified English as the
main language, and /only /Chinese as a second language).

Gerrit

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20110306/3abac7ba/attachment-0001.html>