[tex-hyphen] Procedure for adding alternative patterns

Mojca Miklavec mojca.miklavec.lists at gmail.com
Mon Sep 25 17:27:16 CEST 2017


Dear Stojan,

On 25 September 2017 at 11:09, Стоян Димитров wrote:
> Greetings,
>
> I'd like to propose for adding an alternative set of hyphenation patterns
> for language that to already have one. What is the procedure I should
> follow? Is it possible?

I'm not saying that it's not possible, but it's something we've been
"sweeping under the rug" for the past 9 years (since me and Arthur
started with the patterns cleanup).

So if someone wants to address the issue, it would help to come up
with some reasonable solution as well. Another language with a similar
problem is Russian. I believe there are roughly 6 alternative patterns
for that.

> What about the license, author permissions?

That depends on the patterns / the files you intend to use and has
nothing to do with the rest of technical problems.

In ideal case the author would agree to some permissive licence (for
us the ideal seems to be MIT which seems to work for all projects
involved so far).

> Where should be hosted?

That depends on the solution you come up with. To start with, the
patterns should be *somewhere* where one could fetch them. What is the
"upstream" source?

> Are there any restrictions I should take into account.
>
> The language in question is Bulgarian. In the wild there are two sets of
> patterns. The one officially listed here and the one that is not. As far as
> I can tell both of them are used by the Bulgarian community. Though there
> are no figures I can present to you. There are no publicly available quality
> checks or auditions for any of them so this also could not be used as
> factor.

By far the best solution would be to encourage a couple of local
linguists, run both patterns through long list of words and do some
extensive analysis, and decide which pattern set works best. Or
perhaps come up with the third set that works better than any of the
other individual ones (see also https://xkcd.com/927/). It would be
super beneficial if someone did the quality analysis of the existing
patterns.

Now, there are numerous different possibilities to address the
problem. If the first and best solution is not an option, you can make
it work by also collaborating with babel and polyglossia to support
those new patterns in some consistent way. But then you also need to
educate local TeX users to make use of those options. Germans have a
package that replaces the default set of patterns with alternative
ones for example. My biggest fear is that after doing all the work you
might still end up with just 5 users or even less actually using those
alternative patterns. (In all those 9 years this was for example the
first question about alternative patterns for Bulgarian. If the other
alternative was in high demand, I would expect the question to pop up
earlier. My estimate is not that the current set is superior in any
way, but that simply users don't care enough to explicitly switch to
another set.)

But the problem will remain elsewhere even if you address the problem
inside TeX. The patterns may be used to hyphenate websites, to
hyphenate documents in (Open/Libre/Whatever)Office etc. I'm pretty
sure that you cannot convince all the web browser developers to
support multiple sets of hyphenation patterns per language (and then
all the website content contributors to specify which set of patterns
should be used when hyphenating Bulgarian?) unless there's in fact
some fundamental difference in the grammar (rather than just different
quality of the patterns). From that perspective it would make more
sense to agree on a single good quality set of patterns.

For example there are three sets of hyphenation patterns for German:
one set for traditional Swiss German, one set for traditional German
and one set for modern German. If someone wants to explicitly follow
the rules from more than 20 years ago (for example to reproduce an old
book), they explicitly switch. But that "language variant" also has an
officially registered tag in the standard and I'm still pretty sure
that no browser supports that (I would be glad to be proven wrong
though :).

I see three fundamentally different approaches:
- patterns end up in hyph-utf8
- patterns end up in some new repository "hyph-utf8-alternatives"
- you or someone else creates a new package with alternative patterns,
similar to what Germans are shipping(*)

I have some "problems" justifying going for the first option without a
damn good justification as that only introduces additional mess and
handling of special cases. We could theoretically do the second. I
would need slightly less justification for that, but still at least
somewhat good reason to do it. And then we would need sufficient
support also from babel & polyglossia, else this hardly makes any
sense anyway. Doing the third is always an option that any user is
free to do and we can help if needed.

Again ... any option that the linguists would come together and
provide the definitive answer about getting a single set of high
quality patterns?

Mojca

(*) Germans actually have 5 sets of patterns right now, plus three
additional ones loaded by an additional package. So 8 sets in total.
Two sets correspond to "traditional" and "modern" German, they are
super old and are only ever used in TeX, pdfTeX and other 8-bit-only
engines/formats. LuaTeX, XeTeX, pTeX would take the patterns from the
new effort (http://projekte.dante.de/Trennmuster) which provides three
sets of patterns: "traditional", "traditional Swiss" and "modern"
German. But the Germans are then afraid to break backward
compatibility of older documents which is why we never got rid of the
old patterns (yet). And because some users want to use the new
patterns with pdfTeX, all these three sets are duplicated again in an
external package (https://www.ctan.org/pkg/dehyph-exptl). I find this
"mess" somewhat hard to justify and would much prefer to stick to just
three sets of the patterns from the Trennmuster project (three
patterns per language should be complex enough :). Then we have some
further mess with some other languages from the Balkan where people
cannot even decide which language they speak and thus which
hyphenation patterns to use :) :) :)

I would be really grateful not to introduce additional mess with other
languages.



More information about the tex-hyphen mailing list