[tex-hyphen] language tag for Serbian (Serbo-Croatian?) patterns

Mojca Miklavec mojca.miklavec.lists at gmail.com
Wed Jul 20 18:08:20 CEST 2011


On Wed, Jul 20, 2011 at 17:30, Jonathan Kew wrote:
> I notice that hyph-utf8 currently includes patterns labelled as sh-Cyrl and sh-Latn. I'm wondering whether "sh" is really the most appropriate language code here; my understanding is that this is a deprecated code for "Serbo-Croatian", not recommended for current use. Shouldn't these patterns rather be labelled as sr-Cyrl and sr-Latn?

Me and Arthur probably spent a whole afternoon arguing about the most
appropriate name. A few facts:

- sh patterns should work ok for Serbian, Croatian, Bosnian, ... and
were originally created for Latin script only (in the time when this
was still considered a single language);

- in 2008 the original author insisted that patterns should work fine
for any of the given languages and that there was no reason why one
should call these patterns "Serbian" or "Bosnian" or "Croatian" for
the sole political fact that "Serbo-Croatian" quasi doesn't exist

- at the same time the author also provided us with exactly the same
patterns in Cyrillic script;

- for some unknown reason Croatians wanted and created their own
patterns (they are different, but it is not exactly clear what the
differences are and it is not always exactly clear to me which
hyphenation points are better; we might investigate more closely one
day; there is no clear reason at the moment why Croatian would need
their own patterns);

- after creation of "sh-latn" and before creation of "sh-cyrl" a group
of people took "sh-latn", converted them into Cyrillic script, added a
bunch of extra patterns (more about that later) and named these
patterns Serbian;

- consequently we were left with two "identical" patterns files (sh)
in Latn+Cyrl, with one pattern files (sr) in Cyrl and one pattern file
(hr) in Latn;


If we would want to call the "sh" patterns Serbian:
- we would have to drop at least one set of patterns out of our
package completely; or add other weird qualificators like
sr-cyrl-x-<sometag>;
- we would introduce some artificial and not really necessary renaming
- we would discourage any Bosnian user from using the patterns

We noticed that "sh" has been deprecated, but an interesting fact
(Arthur may correct me if I'm wrong - I'm speaking out of my memory)
is that "sh" has first been deleted from tags and later on introduced
again.

Our primary reference is
    http://www.iana.org/assignments/language-subtag-registry
which does list
    Type: language
    Subtag: sh
    Description: Serbo-Croatian
    Added: 2005-10-16
    Scope: macrolanguage
    Comments: sr, hr, bs are preferred for most modern uses

so from that point of view I have no bad conscious for using the tag
"sh". It says "preferred for most modern uses" and "macro language".
This doesn't really convince me that we should not use it. We do a
"nasty" thing that we then link "serbian" to "sh-latn" and "sh-cryl",
which complicates matters a lot for the sake of building scripts, but
I don't care too much about that.

We are also using "no" for Norwegian (only to some extent), "mul-ethi"
for ethiopic scripts etc. None of those tags really represent any
language.

Additional note: this year we figured out that "sr-cyrl" patterns are
of somewhat questionable quality. We didn't remove them yet, but after
some additional testing we might do that. I strongly suspect that
author just took old patterns generated with patgen and added some
patterns manually (a very very bad practice!!!), breaking many valid
hyphenation points and introducing new invalid ones.

Mojca



More information about the tex-hyphen mailing list