[XeTeX] XeTeX Digest, Vol 48, Issue 13

Mike Maxwell maxwell at umiacs.umd.edu
Sun Mar 9 15:40:28 CET 2008


Barry MacKichan wrote:
>> Would it be worth considering an enhancement to fontspec so that we
>> could write, e.g.,
>> \setmainfont[UprightFont={Hebrew={Adobe Hebrew},CJK={NSimSun}}]{Adobe
>> Garamond Pro]
>> to choose 'Adobe Garamond Pro' as the main font, except in the Hebrew
>> unicode area, in which case use 'Adobe Hebrew', and the CJK area, in
>> which case use 'NSimSun'?

Jonathan Kew replied:
> This is a suggestion/request that has come up several times, and I  
> can certainly understand the attraction. Essentially, you're asking  
> for a model with several simultaneous "current fonts" for different  
> scripts, and an engine that chooses the appropriate one on a per- 
> character basis.
> 
> However, a general solution to this is trickier than people think,  
> IMO. The main problem I see is how to (reliably) deal with the  
> characters classified as "script=Common" in Unicode...
> ...
> In many cases, fairly simple heuristics could be used to choose a  
> font for "common" characters based on the script of neighboring  
> script-specific letters, but this will not always be right (and  
> sometimes there may not be any "neighboring" letters available to the  
> engine at the point where it needs to choose a font). So we could  
> find ourselves in a situation where people assume they can mix  
> scripts without providing markup, and the engine guesses right much  
> of the time -- but sometimes makes inappropriate choices. 

Let me try the other side of this--realizing as I do so that I'm being 
an armchair general; I won't be able to do any of the work, only 
complain :-).

Computer programming has a long history of people saying "X is too 
difficult for computers to do, it requires human intervention", followed 
by someone else getting the computer to do X--sometimes nearly as well 
as humans, and sometimes better.  TeX is itself an example of this, I 
suspect: I bet typographers didn't think it would work.

There are several areas where LaTeX (and by extension, XeTeX) seems like 
it could do things better/ more automatically; long tables are one area 
I've complained about.  And I'm going to suggest in this email that font 
choice based on script is another.

I'm sure I'm missing some cases here, but at the risk of saying 
something stupid, let me suggest the following set of cases for font 
choice.  Consider a maximal sequence of one or more common script 
characters X (including space characters; by assumption, X can be 
rendered in any of the relevant scripts).  Here are the cases:

0) All the characters of X are space characters.

If one or more of the characters of X is a non-space char, then let L be 
the characters to the left of X in the same block of text (where block = 
paragraph, cell of a table, etc.--where "etc." probably needs to be 
defined, or else "non-etc." needs to be defined), and R the characters 
to the right of X in the same block.  Then:

1) The characters in L and R belong to a single script.
2) The last character in L belongs to script W, and the first character 
in R belongs to script Y.
3a) The last characters in L belong to script W, and R is empty.
3b) Like (3a), but reversed left-and-right.
4) L and R are both empty.

Case (0) is probably irrelevant (although I may be wrong; I think I 
heard somewhere that space chars in French are sometimes rendered oddly.)

Case (1) is pretty clear, I think; you render X in the font used for L 
and R.

Case (2) can be broken down on the basis of whether the left-most 
non-space character in X is a left-attaching character (like a close 
quote or close paren), and/or whether the right-most non-space character 
in X is a right-attaching character.  If one or the both is true, then 
those characters are rendered in the font of the adjacent script on that 
side, and the algorithm (heuristic!) re-applied.  If neither is true, I 
guess the correct solution is less clear.  My guess would be use the 
chars on the left, but I suspect it would be best to see examples.

Case (3) seems reasonably clear; use the font of whatever characters you 
have.

As for case (4), my first idea was to use the default font.  However, it 
might be more appropriate (particularly inside a table, for example) to 
iteratively expand the block in both directions and then re-apply the 
above algorithm.

In cases that aren't clear, one could issue an advisory msg, and those 
of us who are obsessive could check the unclear cases.  (It would sure 
be nice if msgs like this had a way to hyperlink to the corresponding 
point in a PDF...but I dream!)

Coming back to the question of whether the computer can be made smart 
enough, I think the real question is what "smart enough" means.  You 
can't expect it to be smarter than the smartest human typographers, and 
it may in fact turn out that they don't agree among themselves.  In 
natural language processing, where people annotate text for various 
linguistic properties and a computer program attempts to "learn" 
generalizations that will allow it to duplicate those judgments, the 
upper limit on what the program can do (in most cases) is called 
"inter-annotator agreement."  That is, you can't do better than the 
annotators.  (Well, perhaps you could, but you don't have any way of 
measuring that.)  Coming back to typography, building in whatever 
heuristics good human typographers use won't let you get better than 
them, but that may be a lot better than what the average user (like me) 
could do by hand.
-- 
    Mike Maxwell
    What good is a universe without somebody around to look at it?
    --Robert Dicke, Princeton physicist


More information about the XeTeX mailing list