[XeTeX] On cross-language font selection

Fri Feb 23 17:54:13 CET 2007

This is a topic that has come up several times in the past few years.  
My view has been (as Will suggested) that it is not possible to come  
up with a comprehensive and general scheme, independent of some kind  
of language markup in the source. You cannot tell purely from the  
Unicode values of the characters in the text what language they  
represent, or what font or other typographic features should be used.  
For example, in a document that combines both Chinese and Japanese,  
the same Han character could be used in both languages, but different  
fonts would probably be wanted.

At another level, there is the problem of punctuation characters that  
are common to many scripts and languages. For example, suppose I have  
a document that mixes Hindi and English. It's easy to say "use a  
Latin font for the English letters, and a Devanagari one for the  
Hindi letters". But what about punctuation such as parentheses, quote  
marks, question marks, etc? The design of these will differ between a  
typical Latin and Devanagari font, being harmonized with the style of  
the letters. But it may not always be possible to reliably guess  
which script a given character should be associated with. In many  
cases, "the script of the preceding letter" would be a reasonable  
guide, but it may not always be correct -- and there may not always  
be any preceding letter at all!

For a web browser displaying arbitrary pages, font fallbacks are a  
good thing; it's better to find a font that makes the text legible,  
even if it sometimes makes choices that are typographically less than  
ideal. But in the context of a professional typesetting system, I  
don't want the computer guessing which font to pick for certain  
ambiguous characters in my document; I want to be sure that I will  
get exactly the fonts I have asked for. With this comes the  
requirement that my markup must always, in some way, provide  
sufficient information to unambiguously specify which font to use,  
not "pick one of this collection, based on some complex heuristic for  
guessing the current script".

However, I also have some good news! :) A new feature planned for  
XeTeX 0.997 will make it easy to implement automatic font switching  
for many simple situations, such as a mixture of Chinese and English,  
with no need for embedded markup. This is *not* a general-purpose  
"font collections" model, or a universal solution to multi-script  
text, but should be a big help for many of the common cases people  
try to implement with active characters, etc. Basically, it allows  
you to "hook in" extra code (such as glue, penalties, font changes,  
etc) between characters of the text, based on character classes  
assigned to each Unicode value.

In view of this, I would suggest that people not spend a lot of time  
and effort on perfecting macro-level solutions just now, but wait and  
see what can be done with the new facilities that 0.997 will provide.

JK