[XeTeX] Microtypography?

Jonathan Kew jonathan_kew at sil.org
Thu May 11 00:29:56 CEST 2006

On 10 May 2006, at 10:09 pm, Bruno Voisin wrote:

> Le 10 mai 06 à 21:48, Jonathan Kew a écrit :
>> So the whole paragraphing algorithm
>> will have to be much more aware of the two-level character/glyph
>> model than it is at present; right now, XeTeX works in terms of
>> characters, and the details of individual glyphs are largely hidden
>> by the rendering technology (ATSUI or ICU/OpenType).
> In case that can be expressed in not too technical terms: what's the
> difference between character and glyph? Character a logical unit (a
> Unicode code point, maybe) and a glyph a physical unit (ie a set of
> pen strokes)? It's not that important probably, but I'm just
> interested in understanding what you mean there.

Yes, a 'character' is a logical unit in the encoded representation of  
textual data; in this context, a Unicode code value such as <U+0061  

A 'glyph' is a visual representation of a character (or not  
necessarily "a character"....see below). So for the same Unicode  
*character* U+0061, there may be many quite different *glyphs*,  
depending on the type design (compare Times Roman 'a' and Avant Garde  

In Latin script, there is a (largely) one-to-one mapping from  
character to glyph in any given font, which explains the lack of a  
clear character/glyph model in TeX (and other software of that era).  
We have instances such as "fi", where the two characters <f, i> may  
be represented by a single <fi> glyph (ligature), but they are only a  
few; TeX essentially treats the ligature as another character, and  
performs the ligature processing in "character space".

A cursive font such as Zapfino stretches this much further; there may  
be dozens of 'a' glyphs in Zapfino, all representing the *same*  
character. The selection of alternate *glyphs* depending on context  
in the word, or on stylistic preferences ("I want lots of  
flourishes", or "I want small capitals") should not involve changes  
to the *characters* of the underlying encoded text; they represent  
replacements of the *glyphs* used to render those characters.

In a script such as Arabic, a letter such as BEH will have very  
different forms depending whether it occurs at the beginning, middle,  
or end of a word. But in all cases, it is the same *character* <U 
+0628>; choosing the proper contextual *glyph* is the responsibility  
of the font rendering subsystem, when presented with a stream of  
character codes to be rendered in a particular font.

So text input, and text-processing tasks such as locating possible  
line-break positions, hyphenation points, etc., are all performed in  
terms of *characters*. But characters have no visual form or metrics;  
only *glyphs* have these. And the mapping between a character  
sequence and a glyph sequence may be a very complex one, especially  
in some Asian scripts. And so, although XeTeX is dealing with  
typesetting text (a sequence of encoded characters), it cannot do  
anything that involves metrics on a per-character basis.

(Look up the "character/glyph model" online if you want to read lots  


More information about the XeTeX mailing list