[tex-hyphen] luatex and file names

Mon Aug 10 17:32:38 CEST 2015

> - What is the logic behind the idea of preloading some data in the
>   format with luatex?

  You mean as opposed to having them sit in a Lua file for packages to
load on demand?  I suppose the intent was that the formats would contain
enough (meta)data to be self-descriptive - and we wanted to include
hyphen.tex as \language0 anyway.  You’d have to ask the three co-authors
of luatex-hyphen (Khaled, Manuel, and Élie) directly, as I didn’t
contribute to that part of hyph-utf8 much, and I don’t think any of them
reads this list.

> - Is there any convention as how hyphenation files should be named?
>   Apparently most of them follow the pattern (load)hyph-LL (LL
>   = lang iso code) and (load)hyph-LL-SSSS (SSSS = script iso code),
>   but not all. (And of course, the encoding in the form .ec.)

  This part has been implemented by Mojca and me.  We follow BCP 47 that
is, to our knowledge, the only standard that allows to tag languages and
their variants to the level of precision that we need.  It is defined by
the IETF and consists of several of their RFCs (BCP stands for “Best
Current Practice”), currently RFC 5646 and 4647; see https://tools.ietf.org/html/bcp47
for the full text.

  It can have many elements; to sum up, any of the following can occur -
only the first element is mandatory, the rest is optional, and the order
is normative:

  * A language code (2-letter ISO 639-1, or, failing that, 3-letter ISO 639-3)
  * A script code (4-letter ISO 15924)
  * A country code (2-letter ISO 3166-1 or 3-digit UN M.49)
  * Additional elements defined in the register (5 to 8 letters or digits)
  * Private elements prefixed by -x-

  The registry is maintained by IANA at http://www.iana.org/assignments/language-subtag-registry

  This standard is very useful because, as mentioned, it allows great
precision, but it also encourages not to go into more detail than is
needed, and we make every effort to follow that - unlike many software
vendors that introduce a flurry of variants of Spanish, Portuguese, or
French with little actual differences (tagged as es-ES, es-MX, es-AR,
pt-PT, pt-BR, fr-FR, fr-BE, fr-CA, etc.).  For each of these three
languages we actually have only one set of patterns.

  The language tags that we do actually use show examples of all the
different tag elements above, as for example:

  * Many languages are identified by their 2-letter ISO 639-1 code
    alone; but some of them don’t have an ISO 639-1 code and we thus use
    the (3-letter) ISO 639-3 code: Friulan [fur], Ancient Greek [grc],
    Piedmontese [pms], and ... Mojca, where have the Ottoman Turkish
    patterns gone?  Anyway.  Moving on:

  * Script tags are used for languages of the Bosnian-Croatian-Serbian
    diasystem: sh-cyrl, sh-latn and sr-cyrl (see the thread
    starting at http://tug.org/pipermail/tex-hyphen/2011-July/000805.html
    for a discussion of the [sh] and [sr] parts)

  * Country codes are used for English: en-gb and en-us

  * Subtags from the registry are used for German and Greek:
    de-1901 and de-1996 (“old” and “new” spelling, first discussed in
    1996 but only finalised in 2006), and el-monoton and el-polyton
    (sadly not “monotonic” and “polytonic” because of the 8-character limit)

  * Finally, for some languages we had to make up a private tag;
    fortunately there are only two of them: la-x-classic for “Classical”
    Latin -- a bit of a misnomer as what it implements is the spelling
    of Latin where ‘v’ is not used (only ‘u’ is); apart for that it’s no
    closer to Classical Latin than the original set of patterns, so we
    could probably find a better name and tag.  The other private tag is
    for “Mongolian LMC”, tagged mn-cyrl-x-lmc as a matter of pure
    convenience: these patterns were once the only set of patterns for
    Mongolian, and had been created by Oliver Corff for his specialist
    needs (typesetting an 18th century pentaglot dictionary).  When new
    patterns were produced by Mongolian users for use in current
    documents, it seemed an obvious choice to change to these
    (incidentally the only change we’ve ever made when unifying all
    patterns into hyph-utf8), while of course keeping the old patterns
    for Oliver to use.  LMC was the name of the font encoding he devised
    for this purpose.

  Does that answer your question?

	Best,

		Arthur