[tex-hyphen] String preparation

Wed May 25 22:42:18 CEST 2016

> So let's take an example : let's say I want to consider œ as just one
> unit in order to simplify the problem. What I want in the general case
> is left/righthyphenmin=2, so what I want to get is
> 
>     œ́di-pus
>     œdi-pus
> 
> and not
> 
>     œ́-di-pus
> 
> How can I achieve that? These cases are rare enough to be handled by
> hand so I could just do something like
> 
>     .œ́8
> 
> but then the patterns wouldn't work anymore for left/righthyphenmin=1...

  They work if you add 8́  (8, acute accent), which you should have
anyway.

> They are. Again, let's take an example : now I want æ to be considered
> as two units, so I want
> 
>     æ-ter-nam
> 
> but I still want left/righthyphenmin=2 in the general case, like
> 
>     aba-cus
> 
> and not
> 
>     a-ba-cus
> 
> How can I achieve that?
> 
> What I understand is that you're proposing to set left/righthyphenmin=1
> and to add all possible patterns like
> 
> .a8b
> .a8c
> .a8d
> .a8e
> ...
> .b8a
> .b8c
> .b8d
> 
> all 26*(26-1) combinations, plus those at the end... Is it what you're
> proposing?

  Yes, that’s 1300 patterns, in other words nothing.

>            They make the patterns specific to left/righthyphenmin=2
> again, and I'd like to avoid that...

  \lefthyphenmin and \righthyphenmin are only conveniences to avoid
specifying certains patterns, so in a way all pattern sets are tied to
some value.  In practice patterns don’t often get used with other values
than the ones they’re generated for, and if you want different values
you always need a new pattern set, for good typography.  Hans does
sometimes mention that educational publishers want much higher values
such as 5 or 6, and at that point it doesn’t make much difference what
happens near the beginning of the word.

  The issue here is that you have at least three different problems that
have to be addressed by different methods:

  1. You want some characters to be ignored for hyphenation, in this
     case the combining acute accent – or equivalently you want some
     character sequences to be considered equivalent to other sequences.
     There is a general mechanism for that (\savinghyphcodes), that has
     never been explored seriously for all I know, and if you’re not
     going to use it you need to capture this equivalence in the
     patterns.

  2. You want some characters to be considered two characters with
     respect to \lefthyphenmin (æ, œ).  This is a requirement specific
     to your particular language and style, you need to capture this in
     the patterns.

  3. You have the problem that one particular character sequence really
     is one element from the user’s point of view (œ́).  There is no
     general mechanism for that in TeX but there is a concept in
     Unicode, that of grapheme cluster (http://unicode.org/glossary/#grapheme_cluster),
     defined informally as “what the user thinks of as one character”
     and is exactly what we need here.  For languages using the Latin
     alphabet, and to a great extent Greek and Cyrillic too, most
     grapheme clusters can be represented as a single Unicode character,
     but that’s absolutely not the case for many other languages and
     scripts; all of the major writing systems from South and South-East
     Asia, to start with.

     Clearly that’s the direction we need to go if we want to improve
     the situation; there could be an additional set of parameters for
     hyphenation, or a switch to change the interpretation of \left and
     \righthyphenmin.  Obviously this will only have a marginal effect
     on Latin-based languages (and Greek, Cyrillic) and that will solve
     the current problem.  Until then, you’ll have to tweak the patterns
     to make the characters behave as grapheme clusters.

  To sum up, point 1. and 2. are specific to your particular language
and style, and are not a problem of TeX (that even has additional
capabilities that may prove useful for 1.); 3. *is* a problem of TeX,
and Unicode has all the provisions one needs to address it.

	Best,

		Arthur