[XeTeX] XeTeX Digest, Vol 48, Issue 13
Mike Maxwell
maxwell at umiacs.umd.edu
Sun Mar 9 15:40:28 CET 2008
Barry MacKichan wrote:
>> Would it be worth considering an enhancement to fontspec so that we
>> could write, e.g.,
>> \setmainfont[UprightFont={Hebrew={Adobe Hebrew},CJK={NSimSun}}]{Adobe
>> Garamond Pro]
>> to choose 'Adobe Garamond Pro' as the main font, except in the Hebrew
>> unicode area, in which case use 'Adobe Hebrew', and the CJK area, in
>> which case use 'NSimSun'?
Jonathan Kew replied:
> This is a suggestion/request that has come up several times, and I
> can certainly understand the attraction. Essentially, you're asking
> for a model with several simultaneous "current fonts" for different
> scripts, and an engine that chooses the appropriate one on a per-
> character basis.
>
> However, a general solution to this is trickier than people think,
> IMO. The main problem I see is how to (reliably) deal with the
> characters classified as "script=Common" in Unicode...
> ...
> In many cases, fairly simple heuristics could be used to choose a
> font for "common" characters based on the script of neighboring
> script-specific letters, but this will not always be right (and
> sometimes there may not be any "neighboring" letters available to the
> engine at the point where it needs to choose a font). So we could
> find ourselves in a situation where people assume they can mix
> scripts without providing markup, and the engine guesses right much
> of the time -- but sometimes makes inappropriate choices.
Let me try the other side of this--realizing as I do so that I'm being
an armchair general; I won't be able to do any of the work, only
complain :-).
Computer programming has a long history of people saying "X is too
difficult for computers to do, it requires human intervention", followed
by someone else getting the computer to do X--sometimes nearly as well
as humans, and sometimes better. TeX is itself an example of this, I
suspect: I bet typographers didn't think it would work.
There are several areas where LaTeX (and by extension, XeTeX) seems like
it could do things better/ more automatically; long tables are one area
I've complained about. And I'm going to suggest in this email that font
choice based on script is another.
I'm sure I'm missing some cases here, but at the risk of saying
something stupid, let me suggest the following set of cases for font
choice. Consider a maximal sequence of one or more common script
characters X (including space characters; by assumption, X can be
rendered in any of the relevant scripts). Here are the cases:
0) All the characters of X are space characters.
If one or more of the characters of X is a non-space char, then let L be
the characters to the left of X in the same block of text (where block =
paragraph, cell of a table, etc.--where "etc." probably needs to be
defined, or else "non-etc." needs to be defined), and R the characters
to the right of X in the same block. Then:
1) The characters in L and R belong to a single script.
2) The last character in L belongs to script W, and the first character
in R belongs to script Y.
3a) The last characters in L belong to script W, and R is empty.
3b) Like (3a), but reversed left-and-right.
4) L and R are both empty.
Case (0) is probably irrelevant (although I may be wrong; I think I
heard somewhere that space chars in French are sometimes rendered oddly.)
Case (1) is pretty clear, I think; you render X in the font used for L
and R.
Case (2) can be broken down on the basis of whether the left-most
non-space character in X is a left-attaching character (like a close
quote or close paren), and/or whether the right-most non-space character
in X is a right-attaching character. If one or the both is true, then
those characters are rendered in the font of the adjacent script on that
side, and the algorithm (heuristic!) re-applied. If neither is true, I
guess the correct solution is less clear. My guess would be use the
chars on the left, but I suspect it would be best to see examples.
Case (3) seems reasonably clear; use the font of whatever characters you
have.
As for case (4), my first idea was to use the default font. However, it
might be more appropriate (particularly inside a table, for example) to
iteratively expand the block in both directions and then re-apply the
above algorithm.
In cases that aren't clear, one could issue an advisory msg, and those
of us who are obsessive could check the unclear cases. (It would sure
be nice if msgs like this had a way to hyperlink to the corresponding
point in a PDF...but I dream!)
Coming back to the question of whether the computer can be made smart
enough, I think the real question is what "smart enough" means. You
can't expect it to be smarter than the smartest human typographers, and
it may in fact turn out that they don't agree among themselves. In
natural language processing, where people annotate text for various
linguistic properties and a computer program attempts to "learn"
generalizations that will allow it to duplicate those judgments, the
upper limit on what the program can do (in most cases) is called
"inter-annotator agreement." That is, you can't do better than the
annotators. (Well, perhaps you could, but you don't have any way of
measuring that.) Coming back to typography, building in whatever
heuristics good human typographers use won't let you get better than
them, but that may be a lot better than what the average user (like me)
could do by hand.
--
Mike Maxwell
What good is a universe without somebody around to look at it?
--Robert Dicke, Princeton physicist
More information about the XeTeX
mailing list