[XeTeX] How to use intercharclasses (was "Issue with CJK in pdf build")
Michiel Kamermans
pomax at nihongoresources.com
Wed Nov 18 19:50:46 CET 2009
Scott Kohler wrote:
> As a "lurker" on this list, your explanation was very helpful to me. And yes, please, I'd like to read your explanation on how to use intercharclasses - seems like a feature you'd always want to use when typesetting multi-language documents.
Fair enough, I shall explain it to the best of my understanding, and
Jonathan can jump in if I misrepresent the concept =)
The idea of intercharclasses is that characters can be assigned a class
number, and that XeTeX will automatically insert specific TeX code
between characters from one class, and characters from another class.
This behaviour can be turned on with the command:
\XeTeXinterchartokenstate = 1
Classes are numbered 0 through 255, but by default some characters are
already assigned classes; all Latin characters are class 0 in XeTeX, all
CJK characters are captured by classes 1, 2 and 3, class 254 has been
tentatively reserved for a "wild card" class (ie, every class; this is
useful when you have lots of uniform transition rules), and class 255 is
used by the "boundary" characters (spaces and the like).
So how do we use it?
Say we have a document that mixes some ASCII text with CJK text, and we
want to automatically switch fonts because the font that looks nice for
ASCII doesn't support CJK, and the CJK font that we like has horribly
disfigured and weirdly spaced ASCII characters. In order to make this
happen, we need a few things: 1) fontspec, to conveniently load fonts,
2) some font family definitions, so we can conveniently change fonts,
and 3) transition rules for the different character classes, so that the
right change command is issued at the right time. And we'll do all this
in our preamble (although we could also do it in a separate
package/style file):
1: \documentclass{article}
2: % turn on intercharclass behaviour
3: \XeTeXinterchartokenstate = 1
4: % require fontspec
5: \usepackage{fontspec}
6: % set up two fonts, one for latin, and one for CJK
7: \newfontfamily{\latinfont}{Times New Roman}
8: \newfontfamily{\cjkfont}{Ume Mincho}
9: % set up the transition rules - first from Latin (or boundary) to CJK
10: \XeTeXinterchartoks 0 1 = {\cjkfont}
11: \XeTeXinterchartoks 0 2 = {\cjkfont}
12: \XeTeXinterchartoks 0 3 = {\cjkfont}
13: \XeTeXinterchartoks 255 3 = {\cjkfont}
14: % then, from CJK (or boundary) to Latin
15: \XeTeXinterchartoks 1 0 = {\latinfont}
16: \XeTeXinterchartoks 2 0 = {\latinfont}
17: \XeTeXinterchartoks 3 0 = {\latinfont}
18: \XeTeXinterchartoks 255 0 = {\latinfont}
19: % finally, we need to be strict with Tex, since it should not resort
to some unknown default font - force the issue:
20: \setmainfont{Times New Roman}
21: \begin{document}
22: ...our document text goes here. こんな風に。 Things should just work.
23: \end{document}
It is rather important not to forget lines 13, 18 and 20. If you do, a
world of unpredictable results is yours.
This is the basic use, and it is already quite useful! But what if you
want a language that isn't latin or CJK? The solution is to write your
own "class definition", which means assigning characters to a specific
class. For instance, say that in addition to Latin and CJK you also have
a lot of box drawing to do, and for this you use the might convenient
series of characters located between unicode points U+2500 and U+257F.
We'll extend the previous bit of code to allow for this:
1: \documentclass{article}
2: % turn on intercharclass behaviour
3: \XeTeXinterchartokenstate = 1
4: % require fontspec
5: \usepackage{fontspec}
6: % set up three fonts, one for latin, one for CJK and one for box art
7: \newfontfamily{\latinfont}{Times New Roman}
8: \newfontfamily{\cjkfont}{Ume Mincho}
9: \newfontfamily{\boxfont}{Courier New}
10: % define a new character class for box drawing
11: \XeTeXcharclass `\┌ 4
12: \XeTeXcharclass `\┐ 4
13: \XeTeXcharclass `\└ 4
14: \XeTeXcharclass `\┘ 4
15: % ...
16: % extended transition rules... To latin:
17: \XeTeXinterchartoks 0 1 = {\cjkfont}
18: \XeTeXinterchartoks 0 2 = {\cjkfont}
19: \XeTeXinterchartoks 0 3 = {\cjkfont}
20: \XeTeXinterchartoks 4 3 = {\cjkfont}
21: \XeTeXinterchartoks 255 3 = {\cjkfont}
22: % To CJK:
23: \XeTeXinterchartoks 1 0 = {\latinfont}
24: \XeTeXinterchartoks 2 0 = {\latinfont}
25: \XeTeXinterchartoks 3 0 = {\latinfont}
26: \XeTeXinterchartoks 4 0 = {\latinfont}
27: \XeTeXinterchartoks 255 0 = {\latinfont}
28: % To box shapes:
29: \XeTeXinterchartoks 0 4 = {\boxfont}
30: \XeTeXinterchartoks 1 4 = {\boxfont}
31: \XeTeXinterchartoks 2 4 = {\boxfont}
32: \XeTeXinterchartoks 3 4 = {\boxfont}
33: \XeTeXinterchartoks 255 4 = {\boxfont}
34: % and force the main font to Times New Roman again
35: \setmainfont{Times New Roman}
36: \begin{document}
37: ...our document text goes here. こんな風に。 Things (┌┐) should just
(└┘) work.
38: \end{document}
Note that we had to define a new font to use for our boxes, we had to
say that all the characters that we use for box drawing are class 4, and
we had to make sure to specify the transition rules, including making
sure we add a "from class 4 to ..." to the latin and cjk rules (lines 20
and 26)!
This gets dangerous the more classes you need, because magic numbers are
bad (what does '4' mean, for instance? how do we know we can use it?
what if some other package already used class 4, we'd be messing up that
package's functionality!). We can rewrite the previous example to
something a bit more useful, although this may not work on older
versions of XeTeX because it uses a relatively new functionality:
automatically getting the next free class number. For this we replace
lines 10 through 15 with:
..: % define a new character class for box drawing, called boxclass
..: \newXeTeXintercharclass\boxclass
..: % build the entire block, from U+2500 to U+257F, using this new class
..: \XeTeXcharclass `\┌ \boxclass
..: \XeTeXcharclass `\┐ \boxclass
..: \XeTeXcharclass `\└ \boxclass
..: \XeTeXcharclass `\┘ \boxclass
..: \XeTeXcharclass `\... \boxclass
...etc
And lines 20, 26, and 29-33 will look like:
20: \XeTeXinterchartoks \boxclass 3 = {\cjkfont}
26: \XeTeXinterchartoks \boxclass 0 = {\latinfont}
29: \XeTeXinterchartoks 0 \boxclass = {\boxfont}
30: \XeTeXinterchartoks 1 \boxclass = {\boxfont}
31: \XeTeXinterchartoks 2 \boxclass = {\boxfont}
32: \XeTeXinterchartoks 3 \boxclass = {\boxfont}
33: \XeTeXinterchartoks 255 \boxclass = {\boxfont}
Already much better, rather than a magical number we now have something
that clearly indicates what it's being used for, and we don't have to
worry about what the underlying class number is. Things will just work.
Of course this is a hassle for class with lots of characters (good luck
assigning all the CJK ideographs extension A characters to a new class
character by character, for instance: there are 6582 of them!) so you
may want to end up writing a command that runs through a for-loop of
some kind to binds lots of characters to a class, rather than specifying
them one by one, as well as a command that can deal with setting all the
to-and-from transitions when you have lots of classes.
However, I won't bother giving examples of the code this would need,
since I'm already writing a package that's supposed to do that, I'm just
still in the process of refining it for general release (this package
exploits the intercharclass system to bind fonts to unicode blocks,
although Jonathan kindly pointed out that binding them to scripts
instead of blocks is probably more sensible... which it is, I've just
not had a great deal of time to make the switch) .
So hopefully this was useful as well, and you'll be going "oh, so it's
actually pretty simple" now... basically as long as this explanation
keeps you from going back to those Office applications just because you
miss font switching, I'll feel like I made the world that tiny bit better =P
- Mike
More information about the XeTeX
mailing list