[XeTeX] Contextual shaping

Wed Nov 27 13:30:18 CET 2013

"Simon Cozens" <simon at simon-cozens.org>:
> This is possibly a daft question, but...
> 
> In traditional TeX, character tokens are processed and put into boxes
> individually, with fairly primitive ligature tables. Obviously XeTeX
> doesn't
> do this, using Harfbuzz (or ICU or whatever) to do the shaping and
> layout.
> 
> My question is, if you're not "showing" individual characters to the
> shaping
> engine for it to consider, what defines how big a string of
> characters to
> shape at a time? Does XeTeX break at the "word" level and then shape
> a word,
> and if so what defines a word? (Chinese has no word breaks!) Or does
> it shape
> an entire paragraph of text at a time (!) and then box up the glyphs
> individually? Or...?
> 
> (I've tried starting at layoutChars in XeTeXLayoutInterface.cpp and
> working
> backwards but I can't understand where I end up: measure_native_node
> shapes a
> node, but what's a node?)

I don't know how Harfbuzz and/or ICU work exactly, but:

- Characters are never put into individual boxes;
- Whatever shaping must take place is defined by sequences of characters; so
you look at each character, see if it must be processed (possibly as a part
of a larger sequence), move to the next character (unless it has been
processed as part of a sequence), and so on. Most of the rules you must follow
to process glyphs are explained here:
    http://www.microsoft.com/typography/otspec/
So your question (as I understand it) is really about processing OT fonts. The
sequence of characters I have mentionned (your string of characters) are
defined in the font itself (for complex sequencess, see e.g. contextual lookups).

As for a node, it is whatever TeX processes internally to build a page: it can
be a character, a kern, a whatsit, a box...

Best,
Paul