[tex-hyphen] String preparation

Arthur Reutenauer arthur.reutenauer at normalesup.org
Wed May 25 19:59:45 CEST 2016

> The old TeX did not support combining characters in any way. XeTeX
> does some "black magic" in the background (I believe it does some
> Unicode normalization, but I don't know the details).

  All XeTeX does is at the input level, it has no effect on the
hyphenation algorithm.  Every document will still see the exact same
patterns, which themselves may not be invariant under different Unicode
normalisation forms – and usually aren’t: we work hard to avoid that
inconsistency for some languages, for example Greek, but these are

> We currently have "œ + combining acute" in one of the patterns and
> that one is also "a bit problematic" because it should be treated as a
> single glyph (and probably isn't). So we also added "do not hyphenate
> before combining acute" which is a bit of a strange rule.

  It’s a perfectly sensible rule and we use the same type for many
languages: Thai and Lao as you point out, and all the Indic-script
languages that have far more such characters, because they’re an
essential part of the encoding; and again it’s really indifferent to the
engine that some patterns contain combining characters, since that
property is simply ignored at the moment.

  I think part of the fascination that is felt for this particular
character sequence stems from the expectation that Unicode should be a
repository of all possible combinations of base letters and diacritics,
that can be used in isolation and presented to the font as a single
character.  The sequence <œ, combining acute> is seen as weird because
it’s the first one in a Latin-script language that cannot be input as a
single Unicode character.  But there’s nothing strange or even unusual
about that; it’s perfectly standard for many languages in other parts of
the world and it is pretty well supported by XeTeX and LuaTeX.

> And some hyphenation patterns (mostly for Indic languages) include
> rules for non-breaking space etc.

  You mean zero-width non-joiner and zero-width joiner, these are part
of the standard encoding for Indic-script languages.  There are no
patterns with NBSP that I can see.



More information about the tex-hyphen mailing list