[XeTeX] Hyphenation of "--" with tex-text mapping on

Tue Nov 29 11:43:07 CET 2005

On 29 Nov 2005, at 2:44 am, Will Robertson wrote:

> On 25/11/2005, at 6pm, Jonathan Kew wrote:
>
>> On 25 Nov 2005, at 12:34 am, Will Robertson wrote:
>>
>>> Sorry to be dense, but if font mappings aren't involved with line  
>>> breaking, how on earth does justification still work correctly?  
>>> Won't the line break be determined based on the width of "--",  
>>> which will later change after conversion to "–"?
>>
>> No, because the font mapping will be applied when the width of  
>> "--" is measured, just as it is applied when the "--" is rendered  
>> to the output.
>>
>> But TeX looks for line-break positions in the underlying text  
>> (sequence of characters and glue, mainly), and the font mapping  
>> doesn't change that. (I considered implementing this differently,  
>> but that would lead to other issues....)
>
> I suppose it is implemented this way so that hyphenation works for  
> scripts when the font mapping is performing contextual re- 
> arrangement and so on; it wouldn't be very useful to only either to  
> have the correct word shapes OR hyphenation!

Or imagine a mapping that implements fi -> U+FB01, etc., for old  
fonts that include the f-ligature glyphs but lack the AAT/OpenType  
tables to automatically use them. If the font mapping affected the  
text as seen by the hyphenation routine, then this would interfere  
with hyphenation of English words with f-ligatures.

> It is unfortunate that an em-dash can be hyphenated in this manner,  
> though:

Definitely!

> do you consider this a bug that might one day be fixed, or is it a  
> design decision that cannot be changed? If the latter, I'll make a  
> note in the fontspec documentation telling people to avoid using it  
> except for legacy documents...

It's a bug -- no, a regrettable feature! -- that is the consequence  
of a design decision that might be changed.

Actually, as I've been reflecting on this, I don't think the  
"hyphenation of em-dash" is occurring during the line-break process  
at all; there's no hyphenation rule that would do that. It happens  
because TeX introduces an empty \discretionary node after the current  
font's \hyphenchar, during the input. That wouldn't happen with CM  
fonts because the lig/kern program runs right there in the input loop  
too, and the hyphens get replaced (and a \discretionary gets  
introduced after the ligature instead, if the final component was the  
\hyphenchar).

So really this illustrates a couple of things. First, it illustrates  
how tightly DEK coupled all the various components of his system --  
input conventions, font layout and ligature rules, line-break  
positions, etc. This enabled him to achieve a great deal within the  
limitations of the systems he was using; but it also makes it very  
tricky to start modifying any part of the system, because there are  
so many interdependencies.

Second, it illustrates that the replacement of "---" with an em-dash  
is something that doesn't really belong at the font/glyph level,  
which is what an AAT/OT ligature rule or (equivalently) a XeTeX font- 
mapping achieves. Instead, it belongs at the text input level. The  
goal is to replace the *characters* "---" in the user's input with  
the *character* em-dash for XeTeX to process and render. This is  
different from simply replacing a sequence of three hyphen *glyphs*  
with an em-dash *glyph* at rendering time.

In TeX, this distinction is lost because there is no clean separation  
of character and glyph; but the Unicode and AAT/OT model presupposes it.

An "obvious" solution would be to provide an "input mapping"  
mechanism; in effect, "TeX text" is a new encoding, slightly  
different from ASCII, in that certain multi-byte sequences actually  
represent unique Unicode characters. XeTeX already supports various  
input encodings, though currently only codepages known to the ICU  
text conversion library. But this could be extended to custom  
mappings written with TECkit.

However, a "global" input mapping that converted "---" to em-dash is  
probably *not* really the answer. For example, multiple hyphens might  
occur in macro code in situations where mapping them to dashes would  
disrupt things badly. And of course in \tt style, you don't want this  
to happen. What is really happening is that the encoding of the input  
data varies implicitly, depending whether it is destined to be  
typeset in a "normal" TeX roman font or is to be typeset in a  
different style such as \tt or is actually macro code, not text to be  
typeset at all.

So we do need this "input mapping" to be linked with the "current  
font", which is our best guide as to the kind of encoding that was  
intended. And we need it to be applied at an entirely different level  
of the processing from the current "font mapping" -- as text is being  
collected from the input into the list of text/glue/penalty/etc  
nodes, rather than as text runs are being measured/drawn.

As you can see, I'm thinking about it.... no promises at this point,  
but don't assume the current situation is how it will always be.

JK