[tex-hyphen] hyphenation with ligatures in input

Mojca Miklavec mojca.miklavec.lists at gmail.com
Wed Jan 15 19:02:08 CET 2014


On Wed, Jan 15, 2014 at 6:26 PM, Stephan Hennig wrote:
> [CC: lualatex-dev at tug.org
> Please reply to tex-hyphen at tug.org]
>
> Hi,
>
> in the following, I'm only considering LuaTeX with UTF-8 encoded input.
>
> When a ligature character, e.g., fi, is already present in the input
> stream, LuaTeX won't hyphenate that word correctly.
>
> \showhyphens{financial financial}
> \bye
>
>> This is LuaTeX, Version beta-0.76.0-2013120414 (rev 4627)  (format=luatex 2013.12.11)  15 JAN 2014 18:17
>> [...]
>> [][] \tenrm fi-nan-cial finan-cial
>
>
> The same is true for LuaLaTeX, by default, or when activating US
> hyphenation patterns with either Babel or Polyglossia.
>
> However, when activating UK hyphenation patterns the word containing the
> ligature is also hyphenated (code attached at the end).
>
>> This is LuaTeX, Version beta-0.76.0-2013120414 (rev 4627)  (format=lualatex 2013.12.11)  15 JAN 2014 18:20
>> [...]
>> [][] \EU2/lmr/m/n/10 fin-an-cial fin-an-cial
>
>
> Why is that?  I can't find an fi ligature character neither in UK
> hyphenation patterns nor in the exception list (hyph-en-gb.pat.txt,
> hyph-en-gb.hyp.txt).

It's not feasible to provide ligatures in hyphenation patterns. This
is something that the engine needs to handle properly, else this could
lead to an exponential growth of the number of needed patterns. You
can consider this equivalent to trying to provide all possible
combinations of lowercase and uppercase letters for each pattern. On
top of that, imagine that there exists a word in a language where
hyphenation between "f" and "i" is allowed. If a dummy user provides
text with ligatures, there is no way to hyphenate that word properly.

> What set of (ligature) characters is handled such a special way?

First of all, the input should not contain any ligatures in my
opinion. (And Unicode also shouldn't provide codepoints for them.)

Other than that ...

I'm aware that XeTeX can do some text normalization (mostly
composition-decomposition of accented characters), so that it doesn't
really matter what input you use. I don't know how it treats ligatures
though, but without any additional functionality turned on, it seems
that XeTeX has more or less the same problems as LuaTeX.

> Does the handling depend on the fonts used?

If LuaTeX doesn't do or support any "normalization"
(auto-decomposition of the fi ligature into regular f+i), then you
will most probably even end up with "nancial" instead of "financial"
being typeset if you use the wrong font (one without ligatures). Or
you could end up with funny looking words if you use monospaced font,
some intercharacter spacing in titles, ... So it's not just a problem
of wrong hyphenation.

In any case ... this problem needs to be solved on the engine level
(or on the "user level" by fixing the input ;). We cannot and
shouldn't change hyphenation patterns.

Mojca



More information about the tex-hyphen mailing list