[XeTeX] Hyphenation of "--" with tex-text mapping on
jonathan_kew at sil.org
Tue Nov 29 11:43:07 CET 2005
On 29 Nov 2005, at 2:44 am, Will Robertson wrote:
> On 25/11/2005, at 6pm, Jonathan Kew wrote:
>> On 25 Nov 2005, at 12:34 am, Will Robertson wrote:
>>> Sorry to be dense, but if font mappings aren't involved with line
>>> breaking, how on earth does justification still work correctly?
>>> Won't the line break be determined based on the width of "--",
>>> which will later change after conversion to "–"?
>> No, because the font mapping will be applied when the width of
>> "--" is measured, just as it is applied when the "--" is rendered
>> to the output.
>> But TeX looks for line-break positions in the underlying text
>> (sequence of characters and glue, mainly), and the font mapping
>> doesn't change that. (I considered implementing this differently,
>> but that would lead to other issues....)
> I suppose it is implemented this way so that hyphenation works for
> scripts when the font mapping is performing contextual re-
> arrangement and so on; it wouldn't be very useful to only either to
> have the correct word shapes OR hyphenation!
Or imagine a mapping that implements fi -> U+FB01, etc., for old
fonts that include the f-ligature glyphs but lack the AAT/OpenType
tables to automatically use them. If the font mapping affected the
text as seen by the hyphenation routine, then this would interfere
with hyphenation of English words with f-ligatures.
> It is unfortunate that an em-dash can be hyphenated in this manner,
> do you consider this a bug that might one day be fixed, or is it a
> design decision that cannot be changed? If the latter, I'll make a
> note in the fontspec documentation telling people to avoid using it
> except for legacy documents...
It's a bug -- no, a regrettable feature! -- that is the consequence
of a design decision that might be changed.
Actually, as I've been reflecting on this, I don't think the
"hyphenation of em-dash" is occurring during the line-break process
at all; there's no hyphenation rule that would do that. It happens
because TeX introduces an empty \discretionary node after the current
font's \hyphenchar, during the input. That wouldn't happen with CM
fonts because the lig/kern program runs right there in the input loop
too, and the hyphens get replaced (and a \discretionary gets
introduced after the ligature instead, if the final component was the
So really this illustrates a couple of things. First, it illustrates
how tightly DEK coupled all the various components of his system --
input conventions, font layout and ligature rules, line-break
positions, etc. This enabled him to achieve a great deal within the
limitations of the systems he was using; but it also makes it very
tricky to start modifying any part of the system, because there are
so many interdependencies.
Second, it illustrates that the replacement of "---" with an em-dash
is something that doesn't really belong at the font/glyph level,
which is what an AAT/OT ligature rule or (equivalently) a XeTeX font-
mapping achieves. Instead, it belongs at the text input level. The
goal is to replace the *characters* "---" in the user's input with
the *character* em-dash for XeTeX to process and render. This is
different from simply replacing a sequence of three hyphen *glyphs*
with an em-dash *glyph* at rendering time.
In TeX, this distinction is lost because there is no clean separation
of character and glyph; but the Unicode and AAT/OT model presupposes it.
An "obvious" solution would be to provide an "input mapping"
mechanism; in effect, "TeX text" is a new encoding, slightly
different from ASCII, in that certain multi-byte sequences actually
represent unique Unicode characters. XeTeX already supports various
input encodings, though currently only codepages known to the ICU
text conversion library. But this could be extended to custom
mappings written with TECkit.
However, a "global" input mapping that converted "---" to em-dash is
probably *not* really the answer. For example, multiple hyphens might
occur in macro code in situations where mapping them to dashes would
disrupt things badly. And of course in \tt style, you don't want this
to happen. What is really happening is that the encoding of the input
data varies implicitly, depending whether it is destined to be
typeset in a "normal" TeX roman font or is to be typeset in a
different style such as \tt or is actually macro code, not text to be
typeset at all.
So we do need this "input mapping" to be linked with the "current
font", which is our best guide as to the kind of encoding that was
intended. And we need it to be applied at an entirely different level
of the processing from the current "font mapping" -- as text is being
collected from the input into the list of text/glue/penalty/etc
nodes, rather than as text runs are being measured/drawn.
As you can see, I'm thinking about it.... no promises at this point,
but don't assume the current situation is how it will always be.
More information about the XeTeX