[XeTeX] Hoefler italics and diacritics oddity

Jonathan Kew jonathan_kew at sil.org
Tue Jun 22 10:18:24 CEST 2004

Hi Ross -

On 22 Jun 2004, at 12:49 am, Ross Moore wrote:

> Hi Jonathan,
> On 22/06/2004, at 8:21 AM, Jonathan Kew wrote:
>> However, you may still find that you get line-final swashes in places 
>> you don't want them, such as before a word marked with your 
>> "transliterated" macro, as the commands you're inserting will still 
>> have the effect of breaking text runs that are handed to ATSUI for 
>> rendering. But at least it won't be happening mid-word.
> Can I presume from this that words entered in the source such as:    
> f\"ur
> where the \"u  expands in macros to a Unicode glyph  ^^^^????   
> (whatever the number is)
> is treated by XeTeX as a single word ?

Yes, provided this happens entirely in TeX's "mouth" (i.e., at the 
macro expansion level). IIRC, even \char"XXXX could be used without 
breaking the "word".

> Put another way, is it true that ...
>   1.  XeTeX handles the token stream *after* macro expansions?


>   2.  This is done by replacing TeX's paragraph formatting structures,
>       i.e. line- and page-breaking algorithms.

Page-breaking is untouched. Actually, so is line-breaking, in a sense.

What XeTeX does is to create a box-like item for each "word" 
(contiguous sequence of characters in a "native" font, after macro 
expansion), and hands these off to either AAT or OpenType engines for 
layout/measurement. So the paragraph becomes a list of such 
"native-font word" boxes, interspersed with glue, penalties, etc.; 
TeX's original line-breaking algorithm is applied to this list. (If 
hyphenation is required, these "boxes" will be inspected, broken up, 
and reassembled as necessary during the hyphenation phase.)

> I cannot see how it could be otherwise, but please confirm this.

I think you have the idea.

> Of course, what this means is that with the LaTeX encoding-based method
> of handling accents, as I described in an earlier message, then it is
> possible to get searchable Unicode output of words containing accents
> and (Unicode-supported) diacritics, using just the old-style 7-bit 
> LaTeX
> input source files.
> (This is something that has been requested, if I recall correctly.)
> Presumably alternate forms of accented characters are also possible
>  --- provided the font supports it.

Yes, all this is correct.

>> Whether the end result is appropriate, or whether you're better off 
>> disabling the line-edge swashes altogether, is for you to judge when 
>> you see how it looks.
>> Jonathan
>>> On 22 Jun 2004, at 00:40, Jonathan Kew wrote:
>>>> The \d macro ends up breaking up the text into separate runs, and 
>>>> so you get line-final swash forms (and potentially line-initial 
>>>> swashes afterwards). You can disable these by adding "Smart 
>>>> Swashes=!Line Final Swashes,!Line Initial Swashes" to the font 
>>>> definition.
> Are dot-under letters directly supported in Unicode ?
> (Maybe just some of them, not all.)

Some are directly encoded as precomposed characters, but not every 
conceivable usage.

> Or will these diacritics always break-up words, due to
> the box-constructions otherwise required to produce them ?

Where precomposed characters don't exist, the correct Unicode 
construction would be to use U+0323 COMBINING DOT BELOW following the 
base letter. Then rendering the sequence becomes purely a matter for 
the font.

The problem you face is that currently, very few fonts fully support 
dynamic placement of combining marks. This is particularly an issue for 
AAT, which lacks a proper attachment point mechanism; it's easier to 
handle in OpenType. (But even there, few foundries are doing it yet.)

In the absence of font-level support, you end up having to use 
constructions of boxes and glue to put the dot where you want it, thus 
breaking up the "pure" Unicode character stream.


More information about the XeTeX mailing list