[XeTeX] How to use intercharclasses (was "Issue with CJK in pdf build")

Michiel Kamermans pomax at nihongoresources.com
Wed Nov 18 19:50:46 CET 2009

Scott Kohler wrote:
> As a "lurker" on this list, your explanation was very helpful to me. And yes, please, I'd like to read your explanation on how to use intercharclasses - seems like a feature you'd always want to use when typesetting multi-language documents.

Fair enough, I shall explain it to the best of my understanding, and 
Jonathan can jump in if I misrepresent the concept =)

The idea of intercharclasses is that characters can be assigned a class 
number, and that XeTeX will automatically insert specific TeX code 
between characters from one class, and characters from another class.

This behaviour can be turned on with the command:

\XeTeXinterchartokenstate = 1

Classes are numbered 0 through 255, but by default some characters are 
already assigned classes; all Latin characters are class 0 in XeTeX, all 
CJK characters are captured by classes 1, 2 and 3, class 254 has been 
tentatively reserved for a "wild card" class (ie, every class; this is 
useful when you have lots of uniform transition rules), and class 255 is 
used by the "boundary" characters (spaces and the like).

So how do we use it?

Say we have a document that mixes some ASCII text with CJK text, and we 
want to automatically switch fonts because the font that looks nice for 
ASCII doesn't support CJK, and the CJK font that we like has horribly 
disfigured and weirdly spaced ASCII characters. In order to make this 
happen, we need a few things: 1) fontspec, to conveniently load fonts, 
2) some font family definitions, so we can conveniently change fonts, 
and 3) transition rules for the different character classes, so that the 
right change command is issued at the right time. And we'll do all this 
in our preamble (although we could also do it in a separate 
package/style file):

1: \documentclass{article}
2: % turn on intercharclass behaviour
3: \XeTeXinterchartokenstate = 1
4: % require fontspec
5: \usepackage{fontspec}
6: % set up two fonts, one for latin, and one for CJK
7: \newfontfamily{\latinfont}{Times New Roman}
8: \newfontfamily{\cjkfont}{Ume Mincho}
9: % set up the transition rules - first from Latin (or boundary) to CJK
10: \XeTeXinterchartoks 0 1 = {\cjkfont}
11: \XeTeXinterchartoks 0 2 = {\cjkfont}
12: \XeTeXinterchartoks 0 3 = {\cjkfont}
13: \XeTeXinterchartoks 255 3 = {\cjkfont}
14: % then, from CJK (or boundary) to Latin
15: \XeTeXinterchartoks 1 0 = {\latinfont}
16: \XeTeXinterchartoks 2 0 = {\latinfont}
17: \XeTeXinterchartoks 3 0 = {\latinfont}
18: \XeTeXinterchartoks 255 0 = {\latinfont}
19: % finally, we need to be strict with Tex, since it should not resort 
to some unknown default font - force the issue:
20: \setmainfont{Times New Roman}
21: \begin{document}
22: ...our document text goes here. こんな風に。 Things should just work.
23: \end{document}

It is rather important not to forget lines 13, 18 and 20. If you do, a 
world of unpredictable results is yours.

This is the basic use, and it is already quite useful! But what if you 
want a language that isn't latin or CJK? The solution is to write your 
own "class definition", which means assigning characters to a specific 
class. For instance, say that in addition to Latin and CJK you also have 
a lot of box drawing to do, and for this you use the might convenient 
series of characters located between unicode points U+2500 and U+257F. 
We'll extend the previous bit of code to allow for this:

1: \documentclass{article}
2: % turn on intercharclass behaviour
3: \XeTeXinterchartokenstate = 1
4: % require fontspec
5: \usepackage{fontspec}
6: % set up three fonts, one for latin, one for CJK and one for box art
7: \newfontfamily{\latinfont}{Times New Roman}
8: \newfontfamily{\cjkfont}{Ume Mincho}
9: \newfontfamily{\boxfont}{Courier New}
10: % define a new character class for box drawing
11: \XeTeXcharclass `\┌ 4
12: \XeTeXcharclass `\┐ 4
13: \XeTeXcharclass `\└ 4
14: \XeTeXcharclass `\┘ 4
15: % ...
16: % extended transition rules... To latin:
17: \XeTeXinterchartoks 0 1 = {\cjkfont}
18: \XeTeXinterchartoks 0 2 = {\cjkfont}
19: \XeTeXinterchartoks 0 3 = {\cjkfont}
20: \XeTeXinterchartoks 4 3 = {\cjkfont}
21: \XeTeXinterchartoks 255 3 = {\cjkfont}
22: % To CJK:
23: \XeTeXinterchartoks 1 0 = {\latinfont}
24: \XeTeXinterchartoks 2 0 = {\latinfont}
25: \XeTeXinterchartoks 3 0 = {\latinfont}
26: \XeTeXinterchartoks 4 0 = {\latinfont}
27: \XeTeXinterchartoks 255 0 = {\latinfont}
28: % To box shapes:
29: \XeTeXinterchartoks 0 4 = {\boxfont}
30: \XeTeXinterchartoks 1 4 = {\boxfont}
31: \XeTeXinterchartoks 2 4 = {\boxfont}
32: \XeTeXinterchartoks 3 4 = {\boxfont}
33: \XeTeXinterchartoks 255 4 = {\boxfont}
34: % and force the main font to Times New Roman again
35: \setmainfont{Times New Roman}
36: \begin{document}
37: ...our document text goes here. こんな風に。 Things (┌┐) should just 
(└┘) work.
38: \end{document}

Note that we had to define a new font to use for our boxes, we had to 
say that all the characters that we use for box drawing are class 4, and 
we had to make sure to specify the transition rules, including making 
sure we add a "from class 4 to ..." to the latin and cjk rules (lines 20 
and 26)!

This gets dangerous the more classes you need, because magic numbers are 
bad (what does '4' mean, for instance? how do we know we can use it? 
what if some other package already used class 4, we'd be messing up that 
package's functionality!). We can rewrite the previous example to 
something a bit more useful, although this may not work on older 
versions of XeTeX because it uses a relatively new functionality: 
automatically getting the next free class number. For this we replace 
lines 10 through 15 with:

..: % define a new character class for box drawing, called boxclass
..: \newXeTeXintercharclass\boxclass
..: % build the entire block, from U+2500 to U+257F, using this new class
..: \XeTeXcharclass `\┌ \boxclass
..: \XeTeXcharclass `\┐ \boxclass
..: \XeTeXcharclass `\└ \boxclass
..: \XeTeXcharclass `\┘ \boxclass
..: \XeTeXcharclass `\... \boxclass

And lines 20, 26, and 29-33 will look like:

20: \XeTeXinterchartoks \boxclass 3 = {\cjkfont}
26: \XeTeXinterchartoks \boxclass 0 = {\latinfont}
29: \XeTeXinterchartoks 0 \boxclass = {\boxfont}
30: \XeTeXinterchartoks 1 \boxclass = {\boxfont}
31: \XeTeXinterchartoks 2 \boxclass = {\boxfont}
32: \XeTeXinterchartoks 3 \boxclass = {\boxfont}
33: \XeTeXinterchartoks 255 \boxclass = {\boxfont}

Already much better, rather than a magical number we now have something 
that clearly indicates what it's being used for, and we don't have to 
worry about what the underlying class number is. Things will just work.

Of course this is a hassle for class with lots of characters (good luck 
assigning all the CJK ideographs extension A characters to a new class 
character by character, for instance: there are 6582 of them!) so you 
may want to end up writing a command that runs through a for-loop of 
some kind to binds lots of characters to a class, rather than specifying 
them one by one, as well as a command that can deal with setting all the 
to-and-from transitions when you have lots of classes.

However, I won't bother giving examples of the code this would need, 
since I'm already writing a package that's supposed to do that, I'm just 
still in the process of refining it for general release (this package 
exploits the intercharclass system to bind fonts to unicode blocks, 
although Jonathan kindly pointed out that binding them to scripts 
instead of blocks is probably more sensible... which it is, I've just 
not had a great deal of time to make the switch) .

So hopefully this was useful as well, and you'll be going "oh, so it's 
actually pretty simple" now... basically as long as this explanation 
keeps you from going back to those Office applications just because you 
miss font switching, I'll feel like I made the world that tiny bit better =P

- Mike

More information about the XeTeX mailing list