[XeTeX] Polyglossia: Automatic script detection

Sun Mar 6 19:07:25 CET 2011

Am 06.03.2011 18:28, schrieb mskala at ansuz.sooke.bc.ca:
> On Sun, 6 Mar 2011, Gerrit wrote:
>> In this case, Polyglossia could give priority to the main language written in
>> the text. If the document has main language English and second language
> What is the "main language written in the text"?
>
Sorry, my English... I meant the main language, the text is written in. 
E.g. an English article with some Russian text in it.

>> the document, that the character is ambigunous. For other languages, like
>> Chinese or Japanese, punctuation marks are different to latin, so not that
>> many rules would be necessary.
> Chinese and Japanese are easy to distinguish from English, but not so easy
> to distinguish from each other.  A Chinese text will usually contain
> characters that couldn't be Japanese, and a Japanese text will almost
> always contain characters that couldn't be Chinese, but it's possible to
> construct nontrivial text fragments in either of those languages using
> only characters common to the two.  Similar issues exist between all pairs
> of languages that are written in very similar scripts - such as
> English/French, Russian/Ukranian, and so on.  I don't know if Russian and
> Ukranian might be similar enough we could get away with lumping them
> together, but it may be necessary to distinguish Japanese from Chinese
> because of different character forms, English from French to produce the
> right punctuation spacing, Romanian from others because of the
> cedilla/comma accent issue, Czech and Polish from others because of hacek
> and kreska, and almost every language from almost every other to choose
> the right translations for words like "Section" and "Figure."
Yes, this is what I meant. I primarily thought about this for font 
changing reasons. If we have English and French mixed, the font does not 
need to be changed. Hyphenation is different, though.
Similarily, for a Chinese words in a Japanese text: The Chinese words 
will then be written in a Japanese font (and Japanase simplified 
characters, when necessary). e.g. ?? (Guangdong) will become ?? in 
Japanese, even though it would be ?? in traditional Chinese.
In this case, for font reasons, the language is not that important, but 
rather the script.

I did not mean this method for determing the overall language of the 
document. This is indeed much more complicated. But if we define German 
as the main language (\setdefaultlanguage{german}) of the document, 
"table of contents" will automatically become "Inhaltsverzeichnis". If 
we then define Japanese as a secondary 
language(\setotherlanguages{japanese}), we can write
"Tokio (??) ist die Hauptstadt Japans", instead of having to write
"Tokio (\textjapanese{??]) ist die Hauptstadt Japans."

Of course, we then have a problem if we want to write
"Tokio (??) ist die Hauptstadt Japans und Peking (??) die Chinas", 
because Polyglossia then does not know if ?? is Japanese or Chinese (of 
course we need to define Japanese and Chinese as other languages). 
Polyglossia could use the order of the other languages (first Japanese, 
then Chinese), so that we only have to write:
"Tokio (??) ist die Hauptstadt Japans und Peking (\textchinese{??}) die 
Chinas."

Mixing text in the same script always poses a problem, but more in the 
field of hyphenation, not so much in that of font changing. I guess, we 
do not want to select a different latin font for French, written inside 
of a English document? This would not look good.

> It seems to me that this kind of auto-detection based on character usage
> can only ever *sometimes* work.  Smarter ways of guessing language based
> on bigger units than individual characters (for instance, looking for the
> presence or absence of common words) do exist, but those will break on
> some texts too.  Perhaps you only intended to distinguish general script
> families (like Latin from Cyrillic), not languages (like English from
> Russian), but I think Polyglossia needs to distinguish languages, even
> when it's limited to font selection only.  I don't really object to
> autodetection as long as it's only a deprecated default to make things
> easier for users who don't know any better, but anybody who actually cares
> what language they are writing *must* specify it manually or the system
> will, inevitably, make a wrong guess eventually.

Well, of course, contextual recognition is much more better, especially 
for languages using the same script. But this is really hard to 
implement, and is even more error-prone. Is "also" English or German? Is 
?? the Chinese word for China (Zho-ngguó) , or is it rather the Chu-goku 
region in Japan? My goal was to have this autodetection for texts in 
which occasionally one or two words in another language appear, not 
entire sentences. If we have a complete, longer quote, maybe an entire 
paragraph, it should still be no problem to write \begin{french} ... 
\end{french}

Autodetection is really hard, and I think it is too fuzzy for something 
like Latex, which to some degree tries not to be that fuzzy. If we had 
autodetection based on words or sentences, we would need a large 
dictionary, which also has to be up to date. This is quite some task. 
Therefore, I thought that implementing this relatively easy 
script-autodetection would just be more realistic for the time being. 
The problem arises if we have French in an English text, or Japanese 
/and /Chinese in an English text, but if we only have Japanese or Arabic 
in an English text, the implementation may be realitively easy. If I 
write an English text and will insert ?? into it, the system will have 
no chance to make a wrong guess (provided I specified English as the 
main language, and /only /Chinese as a second language).

Gerrit

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20110306/3abac7ba/attachment-0001.html>