<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Am 06.03.2011 18:28, schrieb <a class="moz-txt-link-abbreviated" href="mailto:mskala@ansuz.sooke.bc.ca:">mskala@ansuz.sooke.bc.ca:</a>
<blockquote
cite="mid:alpine.LNX.2.00.1103061049050.388@tetsu.ansuz.sooke.bc.ca"
type="cite">
<pre wrap="">On Sun, 6 Mar 2011, Gerrit wrote:
</pre>
<blockquote type="cite">
<pre wrap="">In this case, Polyglossia could give priority to the main language written in
the text. If the document has main language English and second language
</pre>
</blockquote>
<pre wrap="">
What is the "main language written in the text"?
</pre>
</blockquote>
Sorry, my English... I meant the main language, the text is written
in. E.g. an English article with some Russian text in it.<br>
<br>
<blockquote
cite="mid:alpine.LNX.2.00.1103061049050.388@tetsu.ansuz.sooke.bc.ca"
type="cite">
<pre wrap=""></pre>
<blockquote type="cite">
<pre wrap="">the document, that the character is ambigunous. For other languages, like
Chinese or Japanese, punctuation marks are different to latin, so not that
many rules would be necessary.
</pre>
</blockquote>
<pre wrap="">
Chinese and Japanese are easy to distinguish from English, but not so easy
to distinguish from each other. A Chinese text will usually contain
characters that couldn't be Japanese, and a Japanese text will almost
always contain characters that couldn't be Chinese, but it's possible to
construct nontrivial text fragments in either of those languages using
only characters common to the two. Similar issues exist between all pairs
of languages that are written in very similar scripts - such as
English/French, Russian/Ukranian, and so on. I don't know if Russian and
Ukranian might be similar enough we could get away with lumping them
together, but it may be necessary to distinguish Japanese from Chinese
because of different character forms, English from French to produce the
right punctuation spacing, Romanian from others because of the
cedilla/comma accent issue, Czech and Polish from others because of hacek
and kreska, and almost every language from almost every other to choose
the right translations for words like "Section" and "Figure."
</pre>
</blockquote>
Yes, this is what I meant. I primarily thought about this for font
changing reasons. If we have English and French mixed, the font does
not need to be changed. Hyphenation is different, though. <br>
Similarily, for a Chinese words in a Japanese text: The Chinese
words will then be written in a Japanese font (and Japanase
simplified characters, when necessary). e.g. 广东 (Guangdong) will
become 広東 in Japanese, even though it would be 廣東 in traditional
Chinese. <br>
In this case, for font reasons, the language is not that important,
but rather the script.<br>
<br>
I did not mean this method for determing the overall language of the
document. This is indeed much more complicated. But if we define
German as the main language <tt>(\setdefaultlanguage{german})</tt>
of the document, “table of contents” will automatically become
“Inhaltsverzeichnis”. If we then define Japanese as a secondary
language<tt> (\setotherlanguages{japanese})</tt>, we can write<br>
“Tokio (東京) ist die Hauptstadt Japans”, instead of having to write<br>
“Tokio (\textjapanese{東京]) ist die Hauptstadt Japans.”<br>
<br>
Of course, we then have a problem if we want to write<br>
“Tokio (東京) ist die Hauptstadt Japans und Peking (北京) die Chinas“,
because Polyglossia then does not know if 東京 is Japanese or Chinese
(of course we need to define Japanese and Chinese as other
languages). Polyglossia could use the order of the other languages
(first Japanese, then Chinese), so that we only have to write:<br>
“Tokio (東京) ist die Hauptstadt Japans und Peking (\textchinese{北京})
die Chinas.”<br>
<br>
Mixing text in the same script always poses a problem, but more in
the field of hyphenation, not so much in that of font changing. I
guess, we do not want to select a different latin font for French,
written inside of a English document? This would not look good.<br>
<br>
<blockquote
cite="mid:alpine.LNX.2.00.1103061049050.388@tetsu.ansuz.sooke.bc.ca"
type="cite">
<pre wrap="">
It seems to me that this kind of auto-detection based on character usage
can only ever *sometimes* work. Smarter ways of guessing language based
on bigger units than individual characters (for instance, looking for the
presence or absence of common words) do exist, but those will break on
some texts too. Perhaps you only intended to distinguish general script
families (like Latin from Cyrillic), not languages (like English from
Russian), but I think Polyglossia needs to distinguish languages, even
when it's limited to font selection only. I don't really object to
autodetection as long as it's only a deprecated default to make things
easier for users who don't know any better, but anybody who actually cares
what language they are writing *must* specify it manually or the system
will, inevitably, make a wrong guess eventually.
</pre>
</blockquote>
<br>
Well, of course, contextual recognition is much more better,
especially for languages using the same script. But this is really
hard to implement, and is even more error-prone. Is “also” English
or German? Is 中国 the Chinese word for China (Zhōngguó) , or is it
rather the Chūgoku region in Japan? My goal was to have this
autodetection for texts in which occasionally one or two words in
another language appear, not entire sentences. If we have a
complete, longer quote, maybe an entire paragraph, it should still
be no problem to write \begin{french} ... \end{french}<br>
<br>
Autodetection is really hard, and I think it is too fuzzy for
something like Latex, which to some degree tries not to be that
fuzzy. If we had autodetection based on words or sentences, we would
need a large dictionary, which also has to be up to date. This is
quite some task. Therefore, I thought that implementing this
relatively easy script-autodetection would just be more realistic
for the time being. The problem arises if we have French in an
English text, or Japanese <i>and </i>Chinese in an English text,
but if we only have Japanese or Arabic in an English text, the
implementation may be realitively easy. If I write an English text
and will insert 中国 into it, the system will have no chance to make a
wrong guess (provided I specified English as the main language, and
<i>only </i>Chinese as a second language). <br>
<br>
Gerrit<br>
<br>
</body>
</html>