[tex-hyphen] Naming of Serbo-Croatian patterns

Mojca Miklavec mojca.miklavec.lists at gmail.com
Sun Jun 22 14:00:47 CEST 2014


Dear Dejan,

On Sat, Jun 21, 2014 at 11:20 AM, Dejan Muhamedagic wrote:
>
> Looking at http://tug.org/tex-hyphen/#languages
>
> 1. I'd really prefer that the language listed for the patterns is
> Serbo-croatian (instead of Serbian). After all, that is the
> language the patterns were prepared for.

I'm sorry about the website. It's still buggy and confusing (among
others authors of multiple patterns are only listed once etc). I will
fix it together with a few other things, but let me explain the
reasons for many bugs with the Serbian language codes first. In all
languages by Serbian we have a one-to-one language <=> code
relationship. So whenever I automatically generate stuff, I tend to
screw it up with Serbian/Serbo-Croation. I think I even spotted this
problem, but didn't fix it right away and then forgot.

To name all the places where the code/name is used:
- name of the original file with tex patterns (hyph-sh-latn.tex)
- we could in principle create hyph-s[hr].tex with both scripts combined
- names of derived files (hyph-sh-latn.ec.tex, ...)
- names of derived files that are used for LuaTeX and elsewhere
(hyph-sh-latn.pat.txt) which could also be either split or combined
- name of pattern loader (loadhyph-sr.tex)
- name of the language in language.dat/language.def/language.dat.lua
(serbian, serbianc, ...); the way how the language is initialized in
Babel
- the way how the language is initialized in Polyglossia
- the name of the package that activates the patterns (hyphen-serbian[.tlpsrc])

- on top of that we have another collection of Serbian patterns
hyph-sr-cyrl.tex written by another author (or actually he modified
your original patterns in Latin script to make them work with Cyrillic
script before you released the Cyrillic patterns yourself); it would
also help if there was no need for two sets of patterns

And it's far from obvious to figure out which ones should be Serbian
and which ones should be Serbo-Croatian. Or where exactly to draw the
line. I can certainly use "Serbo-Croatian" in the first column and
then use "serbian" and "serbianc" in the second column, but that will
be confusing as well, just asking for troubles when some nationalist
will come and start complaining that they won't use the patterns just
because they don't speak Serbo-Croatian ;) ;) ;)


(But what I really admire are a bunch of Serbian packages inside
collection-langcyrillic. You should *really* look into them if you
didn't already.)

1.) Naming and combining the pattern files themselves

We are currently using "sh-latn" and "sh-cyrl" for your patterns. We
certainly need this split for the 8-bit TeX engines. But then users
keep asking why we don't ship a file with combined patterns.

So we could actually decide to create hyp-sh.pat.txt or
hyph-sr.pat.txt instead of or in addition to hyph-sh-latn.pat.txt and
hyph-sh-cyrl.pat.txt with both your pattern files combined. Or maybe
even create hyph-sr.tex or hyph-sh.tex.

Other projects outside of TeX such as Mozilla, Hyphenator and alike
probably need/prefer the patterns in both scripts combined. These
projects also expect the language to be called Serbian, not
Serbo-Croatian. I don't think that any Firefox user has set his/her
language preference to Serbo-Croatian.

2.) Name of the language in Babel

I'm almost sure that users of babel expect \usepackage[serbian]{babel}
to work, not \usepackage[serbocroatian]{babel}. I don't know what
people in other neighbouring countries use though.

The name of that language is defined in language.dat:

% from hyphen-serbian:
serbian loadhyph-sr-latn.tex
serbianc loadhyph-sr-cyrl.tex

and then loadhyph-sr-latn.tex uses:

\ifx\secondarg\empty
    % Unicode-aware engine (such as XeTeX or LuaTeX) only sees a
single (2-byte) argument
    \message{UTF-8 Serbian hyphenation patterns}
    % We load both scripts at the same time to simplify usage
    \input hyph-sh-latn.tex
    \input hyph-sh-cyrl.tex
\else
    % 8-bit engine (such as TeX or pdfTeX)
    \message{EC Serbian hyphenation patterns in Latin script}
    \input conv-utf8-ec.tex
    \input hyph-sh-latn.tex
\fi\else
    % pTeX
    \message{EC Serbian hyphenation patterns in Latin script}
    \input hyph-sh-latn.ec.tex
\fi

3.) Name of the TeX Live package

The name of package enabling your patterns is currently hyph-serbian
(the patterns themselves are distributed in hyph-utf8). All this
package does is to put the following to language.dat (and similar to
language.def and language.lua.dat):

% from hyphen-serbian:
serbian loadhyph-sr-latn.tex
serbianc loadhyph-sr-cyrl.tex

The question would then also be if this package should be renamed.

> 2. The link for T2A patterns is not right.

You probably mean the EC patterns. I will fix this.



But I'm still not sure how to name the language on the website. I
clearly agree to keep the language code "sh" for naming the files.
Everything else in the middle is not so clear.

In principle the column on the left refers to the package name in TeX
Live (not 100% exactly, in particular not for Indic scripts). And the
second column refers to the names used in language.dat as synonyms.

It's still pretty confusing to figure out at what point Serbo-Croatian
should "become" Serbian.

I'm posting this to the mailing list. Maybe others have some
suggestions about improvements related to these patterns.

----------------

In the spirit of differences between hyph-hr,
hyph-sh-latn/hyph-sh/cyrl and hyph-sr-cyrl:

A while ago Arthur made a really interesting comparison between
hyph-sl, hyph-hr, hyph-sh-latn. There were a bunch of differences, but
honestly I wouldn't dare to say what was really wrong hyphenation,
what was just a weird interpretation of hyphenation rules and what
were the actual differences in grammar rules between the languages. I
admit that I don't fully understand the rules for my own language (and
neither do the linguists). Maybe Arthur is able to send just that
table (it would take me too long to figure out how to compile the
document, but you can find the table in our repository under
tests/wordlists/sl).

PS: I'm also enjoying the current discussion about whether or not the
German patterns should support proper hyphenation of the name "Mitić"
(which means becoming incompatible with Latin 9). I hope that the
Germans are aware how to properly hyphenate Ajnštajn and alike ;)

Mojca




More information about the tex-hyphen mailing list