[XeTeX] Japanese, Chinese, Korean support for Polyglossia

Fri Jul 23 17:15:53 CEST 2010

Hello!

I will try to gather some information about Japanese, Chinese and Korean 
support for Polyglossia in the next days.

Because I do not understand tex programming at all, I can only give some 
information here. I will try to write it as detailled as possible, so 
that the implementation should not be that hard :)

What I understand until now – what is possible, what is too different 
would be like this:

For every three languages:

1. Line spacing needs to be increased. All characters from these three 
scripts are written in a square, which would be like writing in capitals 
all the time in Latin fonts. Because of this, the line spacing would be 
too narrow with the default setting.
I do not yet know how much the line spacing actually should be, but I 
will try to figure that out.
Also, line spacing should be according to the text environment. If the 
default language of the document is some western text, the line spacing 
for e.g. \textkorean{} should not be increased. This is because one 
would use this option to enter some Korean text in a western text, where 
it is not desirable to increase the line spacing (you would not do that 
if you enter an abbreviation in all caps, either).
If a CJK language is chosen with \setdefaultlanguage or \begin{korean}, 
the line spacing should be adjusted, though.

2. A date would be in this format: 2010 [word for year] 7 [word for 
month] 23 [word for month].
In Chinese and Japanese, this would be: 2010年7月23日
In Korean it would be 2010년7월23일

3. Chapternames etc. are written with the number between two words: 
ordinal prefix - number – “chapter”
e.g., “chapter 1”: 第1章 in Japanese or Chinese.

4. ”table of contents” etc. needs to be translated

-----

For Chinese and Japanese:

1. There are calendar systems in Japan and Taiwan, which count the year 
after the founding of the republic of China or after the current emperor.
In Taiwan, one simply needs to substract 1911 and get the current year. 
Also, one needs to write 民國 (Mínguó = “Republic”) in front of the year.
E.g.: 2010-07-23 -> 民國99年7月23日
In Japan, the year is depending on the current emperor.
 From 1868 to 1911: Substract 1867 and add a 明治 (Meiji) before the number.
e.g.: 1905 -> 明治38年
 From 1912 to 1925: substract 1911, add 大正 (Taishō)
 From 1926 to 1988: substract 1925, add 昭和 (Shōwa)
 From 1989: substract 1988, add 平成 (Heisei)
if it is the first year of the emperor, don’t write 1年, but write 元年, 
e.g. 昭和元年.
I think, only the last emperor, Heisei, is of practically relevance. It 
would be nice to include the other ones, though.
Before 1868 it is too hard, because they still used the lunar calendar 
at that time. I think nobody needs a calculation for that, though.

2. Both languages still use Chinese numerals, although to a different 
kind of degree.
They need to be converted from arabic digits. The method is different 
sometimes.
For year numbers and page numbers (seldom): Just replace every arabic 
digit with the appropriate Chinese digit (一二三四五六七八九〇). E.g. 
page 354 = 三五四. Year 1980 = 一九八〇年. But: 民國九十八年 (十 = 10; 
not sure about this), not 民國九八年.
For other numbers: e.g. 1324 = 一千三百二十四

3. Another option: If arabic numbers are used, they may need to be 
converted to full width numbers. e.g. 3 = ３

--------

For Japanese:

1. kinsoku shori (line breaking rules). In Japanese, a line cannot be 
broken at every character (like it would be in Chinese). Some 
punctuation marks are prohibited to start or end a line (e.g. 。、「), 
just like in western languages. Also, some Kana are not allowed to start 
a line (ょ、－、っ etc.).
There are different levels of strictness. Punctuation marks like 。 are 
never allowed to break, but for e.g. ょ, the situation may be different. 
There could e.g. be 3 levels of strictness: off (break everywhere), low 
(break everywhere except in front of 。 etc)., medium (don’t break in 
front of ょ, but in front of －), high (don’t break in front of ょ, － 
or any other similar character).
Because Japanese is written without spaces, it can be a little bit 
difficult to achieve this effect. Characters like 。、 are just written 
at the end of the line, so that the line becomes a little bit longer. In 
other cases, it may be necessary to shorten or lengthen the spacing. 
Usually, the only place where this is possible is before/after  。、「 
and similar characters. Also, in some fonts, the characters are not 
actually all the same size, so it may be possible to do that there (not 
sure about that).

For Chinese:

1. They still use the lunar calendar (I don’t yet quite understand the 
calculation). But this is very optional. I don’t think that this is ever 
used in academic writings. Even if, you could just write it by hand. 
Would be a nice feature, though.

2. Support for simplified and traditional Chinese is needed. This would 
change the translations of table of contents etc., and may also have 
some other, typographic effects.

Features, which may not be easily achieved:

1. Vertical writing. Absolutely necessary, but I think extremely hard. 
May need some drastically changes in xetex, if it should not be a dirty 
hack (“put every character in a box and then put all the boxes under 
each other”). Maybe not as necessary for academic writing, though. This 
depends on the subject. In subjects, where mathematics is used, vertical 
writing is not useful. But I think, it is still extensively used in 
subjects like history etc.

2. Ruby characters. They are also extremely necessary (for Japanese). 
They are smaller characters put on top (or below) of the Chinese 
character to indicate the reading. Basically, they are put between the 
lines (in the line spacing), with no change in the line spacing. There 
are different ways of ruby annotations, e.g. mono ruby (every character 
has its pronounciation), group ruby (a complete word, consisting of 
multiple Chinese characters, has the reading put on top). Also, the ruby 
character can overlap on the other characters next to the word (Ruby 
characters are printed at half the size of the base text, which gives 
every Chinese character room for two ruby characters. There may be words 
where the reading is longer than that, e.g. 承る with the ruby 
characters うけたまわ). It can also put a space between the word (in 
compounds. E.g. 躊躇 (ちゅうちょ) would be too long, so it may be 
stretched like 躊 躇.
In vertical writing, the ruby characters go on the right side of the line.

There are also ruby characters (Zhuyin Fuhao) in Taiwan, which is more 
complicated. In vertical writing, they are written like Japanese on the 
right side of the line. In horizontal writing, they are, unlike 
Japanese, written on the right side of the character. It is more 
difficult, because the characters forming a syllable themselves need to 
be stacked vertically, even in horizontal writing, but the tone mark 
goes on the right side of the sylabble. It may be better to let a 
Opentype font handle the composition of the sylabbles (for example via 
ligatures), because I guess that Xetex would not achieve a visually 
pleasing result. The problem is, that there are no opentype fonts who do 
that, as far as I know.

I think, there is a ruby package for the old cjk package, but I don’t 
know if that still works with Xetex.

3. Emphasis. There is no italic writing in Chinese characters. In 
Japanese, emphasis is done by putting 、 on top of every character (as a 
ruby character). This method is quite easily achieved if ruby characters 
are supported. I am not sure about Chinese, but I think they do that 
with a dot, similarly to Japanese.

4. Footnotes: In Japanese, they are also done like the emphasis mark, as 
a ruby character.

Ok, that is all which comes to my mind right now. I will gather more 
information.

I wonder if polyglossia is the right approach for everything? Of course, 
translations of “table of contents” and e.g. kinsoku shori are good for 
polyglossia, but what about ruby characters?
I think, it may be nice to have a CJK package which offers support for 
vertical writing, ruby, maybe calculation of the calendars etc.
They are extremely necessary for these languages, but may not be needed 
for other languages. Maybe it would be good if polyglossia loaded this 
package if it detects one of these three languages. This would then make 
it easy to actually use for example Japanese, because it is not 
necessary to know which packages you need to load.

e.g. just load polyglossia and set Japanese, and it will automatically 
load packages for vertical writing and ruby characters, without the need 
to load these packages on your own.

Because some of the most basic Latex features (like footnotes or 
emphasis) would require this special package, I think it would be best 
if polyglossia then also loads it. But I’m not sure if the design of 
polyglossia is like this.

Gerrit