[tex-hyphen] hyphenation (what else ;-)

Mojca Miklavec mojca.miklavec.lists at gmail.com
Mon May 17 00:55:41 CEST 2010


Dear Mathias,

[After removing all the secret messages] I posted this to our small
mailing list since there are some parts that might be of more general
interest (I hope that's fine for you). Any other dirty details or
trivialia may be discussed off-list.

On Sun, May 16, 2010 at 17:29, Mathias Nater wrote:
>
>> Did the authors send those hyphenmin values to you? It's quite
>> possible that some of our values are wrong.
>
> Yes and yes.
> I'll ask them and repot back to you.
> Maybe people are confusing "theoretically allowed" hyphenation points and
> "typographically nice" hyphenation points.

I'm thinking about the idea that it might be clever to include a
special chapter in the docs, describing those values for each
language, in particular the "minimal" and "nice" values as you say.
Another issue is Sanskrit and Indic scripts where Yves Codet argued
that hyphenmin can be as little as 1 and he tried to eliminate other
options with suitable patterns.

> My dictionary for french tells me that hyphenation is a function of spoken
> syllables in french.
> So in the french word bagatelle the last '-le' isn't a spoken syllable and
> therefore not be broken on a new line.

But such rules (forbidding the hyphenation before -le) should probably
be part of patterns already, not necessary handled with hyphenmin
values.

> But in 'lavabo' you can put -bo on a new line.
> I tested how LaTeX behaves: \showhyphens{lavabo} gives la-vabo even though
> the patterns (.1la 1la 1va 1bo) are allowing '-bo'.
> So, maybe the rules changed or LaTeX makes an error or it's intentionally.

It's not an error. It's just that the current settings for French are
2/3. (Originally they have not been changed anywhere, so we didn't do
any change with hyph-utf8 either.) If you set \righthyphenmin=1,
you'll get the -bo hyphenated as well.

But there has been a pretty lengthy discussion about French already.
Maybe I can try to find it - all I remember was that those numbers are
almost arbitrary.

> From a readability point of view 'lava-bo' is better for me since one can
> guess the rest of the word (whereas you can't guess the rest of la-)

<not-to-be-taken-seriously>
Oh, and yes ... I was already wondering when somebody will come up
with the idea to extend TeX with tolerances for preferable breaking
points in addition to the allowed ones :) :) :)
</not-to-be-taken-seriously>

>> I also noticed that you have Armenian patterns which don't exist
>> anywhere else (or at least I was unable to find them in a quick
>> search).
>
> I got them from Sahak Petrosyan. He wrote them by himself.
> Maybe you could contact him directly.

Thanks a lot. I have contacted him and included the patterns. They are
still in initial phase, but that's fine.

> I am donating them to your
> project under LGPL."

Oh, licence, nice. I think that I forgot to add that. So here are the patterns:

http://tug.org/svn/texhyphen/branches/luatex/hyph-utf8/source/generic/hyph-utf8/languages/hy/generate_patterns_hy.rb
http://tug.org/svn/texhyphen/branches/luatex/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-hy.tex

We'll add them to TL once we do the big update, though I was
considering waiting with creating a new TL package until at least a
single user requests them.

> I put some work in the "APOSTROPHE" question. I noticed some strange,
> redundant patterns in my french pattern file. So i'll have to change that.

If there's anything that needs to be fixed in our patterns (= if TeX
is where you got them from in some funny form), please let me know.

> But more important:
> for most patterns with apostrophe, there's a wordborder equivalent: 'in2i3t
> and _in2i3t
> So depending on how the algorithm sees words the apostrophe-version is
> unnecessary.

Nice & interesting :) I have never noticed or checked that.

> If you hyphenate "l'initiation" in hyphenator.js the program
> parses the word and just sees "initiation".
> Do you know how this works in TeX?

You don't want to know. If apostrophe is not defined to be a letter
then it behaves like:

\showhyphens{l'initiation}\showhyphens{initiation}

Underfull \hbox (badness 10000) in paragraph at lines 1--1
[] \tenrm l'initiation

Underfull \hbox (badness 10000) in paragraph at lines 1--1
[] \tenrm ini-tia-tion

It's kind of a flaw in TeX - nothing gets hyphenated at all since
Knuth didn't forsee that situation (and it's definitely not something
that you would want to mimick). The same is true for composed words
that don't get hyphenated at all unless some extra patterns are added.
On the other hand if one sets the lefthyphenmin to 2 or 3 and sets
lccode of apostrophe ... and then TeX determines that it's ok to break
between i3ni, TeX will happily hyphenate l'i-ni-tia-tion even if
breaking after the first i in i-ni-tia-tion is forbidden.

So maybe this issue is of minor importance to you when compared to TeX.

> Are you planning to put the patterns for hyphenator online? I couldn't find
> them on the server.

svn co svn://tug.org/texhyphen/branches/luatex/collaboration/hyphenator/repo
http://tug.org/svn/texhyphen/branches/luatex/collaboration/hyphenator/repo/

The "branches/luatex" will be change into "trunk" in a few days.

I removed the code that does the sorting since most of patterns seem
to be different anyway and everything needs a lot more checking than I
have imagined.

Once we come to webpage update, I was thinking of asking you for some
JS hints. (My idea was to use independent patten files with one-to-one
mapping from TeX files and your code, have a drop-down box to select a
language and a text box, so that the user could type text and get it
"hyphenated" [visible hyphens] on the fly ... but <some-kind-of>
webpage update comes first ...

Thanks a lot,
     Mojca



More information about the tex-hyphen mailing list