[tex-hyphen] luatex and file names

Mon Aug 10 18:42:13 CEST 2015

As the maintainer of Latin (waiting for a true latinist to take over) I 
can say that the variants, used as modifiers with babel and as options 
in polyglossia, are three: modern, medieval and classic.
Modern and medieval hyphenation were accommodated in the same patern 
file; it is not only a question of u and v, but also of the ligatures æ 
and œ; but, Arthur, you are right when you say that they are almost 
equal; the hyphenation is mostly phonetic in both variants.

The la-x-classic pattern set is very different: besides the u and V 
question the hyphenation is mostly etymological, therefore it is very 
different form the phonetic one, much more difficult to create, since it 
is also necessary to take into account the case endings of declination 
and the diatesis endings of conjugation. The x-classic qualifier is not 
simply a tag with no or negligible contents behind its name, as you 
remark for the three French variants, the three Spanish variants, the 
two Portuguese variants; Surprisingly enough Italian variants such as 
it-IT and it-CH  do not exist, although there exist different spelling 
dictionaries for these two variants (well, actually I am not surprised 
at all, because I maintain also the Italian patterns and I wouldn't do 
such a silly thing as to distinguish Italian-Italian from Swiss-Italian 
hyphenation).

Hans says there is no advantage in preloading the patterns into the 
format file. May be he is right for Lua(La)TeX, but one of the reasons 
why I make little use of LuaLaTeX is because of the "long" time luatex 
employs to load the patterns; once loaded, its speed is almost similar 
to that of XeLaTeX; its performances are better for what concerns 
microtype; its functionality with the interaction between the pdf engine 
and the Lua interpreter is exceptional, but if one does not need to use 
the latter functionality, it is not worth waiting that "long" time when 
one has to typeset a text that uses a half dozen languages, and that, 
besides Engish that is preloaded, must load the other five pattern files 
and must create the suitable hash structures so as to use the other 
patterns in an efficient way. I might be completely wrong, but I 
consider this a glitch not an advantage. For certain applications it is 
certainly an advantage, because, for example, it is possible to 
modify/correct the patterns for special needs. But this is not so frequent.

Claudio

On 10/08/2015 17:32, Arthur Reutenauer wrote:
>> - What is the logic behind the idea of preloading some data in the
>>    format with luatex?
>    You mean as opposed to having them sit in a Lua file for packages to
> load on demand?  I suppose the intent was that the formats would contain
> enough (meta)data to be self-descriptive - and we wanted to include
> hyphen.tex as \language0 anyway.  You’d have to ask the three co-authors
> of luatex-hyphen (Khaled, Manuel, and Élie) directly, as I didn’t
> contribute to that part of hyph-utf8 much, and I don’t think any of them
> reads this list.
>
>> - Is there any convention as how hyphenation files should be named?
>>    Apparently most of them follow the pattern (load)hyph-LL (LL
>>    = lang iso code) and (load)hyph-LL-SSSS (SSSS = script iso code),
>>    but not all. (And of course, the encoding in the form .ec.)
>    This part has been implemented by Mojca and me.  We follow BCP 47 that
> is, to our knowledge, the only standard that allows to tag languages and
> their variants to the level of precision that we need.  It is defined by
> the IETF and consists of several of their RFCs (BCP stands for “Best
> Current Practice”), currently RFC 5646 and 4647; see https://tools.ietf.org/html/bcp47
> for the full text.
>
>    It can have many elements; to sum up, any of the following can occur -
> only the first element is mandatory, the rest is optional, and the order
> is normative:
>
>    * A language code (2-letter ISO 639-1, or, failing that, 3-letter ISO 639-3)
>    * A script code (4-letter ISO 15924)
>    * A country code (2-letter ISO 3166-1 or 3-digit UN M.49)
>    * Additional elements defined in the register (5 to 8 letters or digits)
>    * Private elements prefixed by -x-
>
>    The registry is maintained by IANA at http://www.iana.org/assignments/language-subtag-registry
>
>    This standard is very useful because, as mentioned, it allows great
> precision, but it also encourages not to go into more detail than is
> needed, and we make every effort to follow that - unlike many software
> vendors that introduce a flurry of variants of Spanish, Portuguese, or
> French with little actual differences (tagged as es-ES, es-MX, es-AR,
> pt-PT, pt-BR, fr-FR, fr-BE, fr-CA, etc.).  For each of these three
> languages we actually have only one set of patterns.
>
>    The language tags that we do actually use show examples of all the
> different tag elements above, as for example:
>
>    * Many languages are identified by their 2-letter ISO 639-1 code
>      alone; but some of them don’t have an ISO 639-1 code and we thus use
>      the (3-letter) ISO 639-3 code: Friulan [fur], Ancient Greek [grc],
>      Piedmontese [pms], and ... Mojca, where have the Ottoman Turkish
>      patterns gone?  Anyway.  Moving on:
>
>    * Script tags are used for languages of the Bosnian-Croatian-Serbian
>      diasystem: sh-cyrl, sh-latn and sr-cyrl (see the thread
>      starting at http://tug.org/pipermail/tex-hyphen/2011-July/000805.html
>      for a discussion of the [sh] and [sr] parts)
>
>    * Country codes are used for English: en-gb and en-us
>
>    * Subtags from the registry are used for German and Greek:
>      de-1901 and de-1996 (“old” and “new” spelling, first discussed in
>      1996 but only finalised in 2006), and el-monoton and el-polyton
>      (sadly not “monotonic” and “polytonic” because of the 8-character limit)
>
>    * Finally, for some languages we had to make up a private tag;
>      fortunately there are only two of them: la-x-classic for “Classical”
>      Latin -- a bit of a misnomer as what it implements is the spelling
>      of Latin where ‘v’ is not used (only ‘u’ is); apart for that it’s no
>      closer to Classical Latin than the original set of patterns, so we
>      could probably find a better name and tag.  The other private tag is
>      for “Mongolian LMC”, tagged mn-cyrl-x-lmc as a matter of pure
>      convenience: these patterns were once the only set of patterns for
>      Mongolian, and had been created by Oliver Corff for his specialist
>      needs (typesetting an 18th century pentaglot dictionary).  When new
>      patterns were produced by Mongolian users for use in current
>      documents, it seemed an obvious choice to change to these
>      (incidentally the only change we’ve ever made when unifying all
>      patterns into hyph-utf8), while of course keeping the old patterns
>      for Oliver to use.  LMC was the name of the font encoding he devised
>      for this purpose.
>
>    Does that answer your question?
>
> 	Best,
>
> 		Arthur