[tex-hyphen] Braimstorming about lualatex-hyphen in TeX Live

Manuel Pégourié-Gonnard mpg at elzevir.fr
Wed May 26 17:23:07 CEST 2010

Le 26/05/2010 15:47, Mojca Miklavec a écrit :
> On Tue, May 25, 2010 at 18:50, Manuel Pégourié-Gonnard wrote:
>> - the main item is about generating a Xetex-specific language.dat,
> 1.) The question: do you agree with that at all? (Maybe Karl wants to
> have a word at it as well.)

As far as I am concerned, I'm not /a priori/ opposed to that, but I'm not
convinced it's the way to go at the moment.

> (and yes: we currently load zerohyph or maybe don't load anything at
> all - it's not really important)
The more I think of it, I think the best way to handle the 8-bit vs UTF-8
distinction is in the loadhyph-XX.tex files, since you already are testing for
unicode support in the engine here. Inputing zerohyph under 8-bit engines when
only Unicode patterns are available, and conversily, looks like the right thing
to do.

> 2.) I have tried to
>     touch texlive/2010/texmf-var/tex/xetex/config/language.dat
> but I don't understand why
>     kpsewhich --engine xetex language.dat
> still returns the original one.
Did you run mktexlsr??

> 4.) Also keep in mind that one might want to have a more "advanced"
> version of patterns (for example Hungarian) in LuaTeX than in XeTeX.
> How would that fit into your scheme?
In the current scheme, it fits very well *if* the advanced patterns for LuaTeX
are available as plain text. Otherwise there is no support for it.

> %
> % ushyphmax.tex, on the other hand, includes Gerard Kuiken's additional
> % patterns; it is not frozen.
> usenglishmax;  ushyphmax.tex
> %
> % FYI, ushyph.tex is Dr. Kuiken's smaller set of patterns; with today's
> % large memories, there is no reason to use it, and we don't list it here.
> % ushyph1.tex is another (historical) name for hyphen.tex.
> % ushyph2.tex is another (historical) name for ushyph.tex.
> % --karl
> %
> We would need to remove the comments completely unless some comment
> field will exist.
This is part of language.us, which is manually edited, hence irrelevant here.

> 1.b.) I would like to have a comment field in the final lua file for
>    disabled:reason
> since reason might be an "arbitrary complex string". What if you'll
> ever need two specials and you'll accidentally have some commas or
> colons inside "reason"? I would much prefer to have
>     special="disabled"
>     comment="Disabled due to blablabla."
> in the language.dat.lua.
We can't need any other special combined to "disabled", that wouldn't make sense.

> Though, from my point of view, I would not care about such languages
> at all.

What do you mean, not care at all? Currently, not providing an entry in
language.dat.lua means dumping them in the format, something you don't want to do...

> You only need to know that the language is disabled since you
> want to prevent loading it at format-generation time, right? Who cares
> if the language simply remains undefined like any other unexistent
> language? You don't really need to explain why it's not defined; or
> rather; it should be enough to have just a normal lua comment; you
> should not need to print out a reason when user requests the language
> "ibycus". It simply won't exists.

It will exist. Currently the list of languages is defined from language.dat.
Doing otherwise would require moe invasive changes to the code from babel,
something I didn't want to do without a reason.

So, the state of most languages right after the format is loaded is: allocated
(that is, \languageX reserved, associated \l@<name> macro defined,
(left|right)hyphenmin remembered) but no patterns or exceptions defined. When
the user requests that the parameters for this language be activated, the
patterns and exceptions are read if they are available.

> Do you think anyone will ever care?

Let's remember the context. We are almost silently (that is, silently except for
people who read the log) replacing some code of babel by custom code. In
general, exactly the same result will be achieved, so I thing it's right to
silently change the technical details. For disabled languages, we change the
final result. Arguably, in some cases (ibycus, mongolian-mlc), the original
result was already not usable, so arguably it's not a problem to change the
result. In other cases (maybe german-x temporarily), the original result was
usable, while the new isn't. This is a big difference, and I just don't want it
to be done silently, period.

Said otherwise, the error message support is already implemented and working,
and may be useful at least in some case. Do you see any good reason do undo it?

> At format-generation time
> =============
> 1.) special="language0", -- should be dumped in the format
> What do you do when you want another language to be dumped into format
> (for whatever weird reason you might think of, see also nr. 2) and
> don't want it to have the number 0 :)

I just do nothing :-) The system is completely backward-compatible: if you
provide patterns for a language, without plain-text version nor special entry in
its tlpsrc file, it will *just work* and be dumped in the format.

> Or if some person that has no
> overview will come and name another language "language0"?
This can mean only one thing: the person has hand-edited language.dat or
language.dat.us, disregarding the very explicit warnings in those files, and
already broke language compatibility in pdfTeX- or XeTeX-based formats. We don't
want to worry about that.

> Of course I didn't test anything, so I'm not sure about how this
> works. Do you want/need a modification in hyphen-base or did you
> "hardcode" that language into TL tools? (It would be nice to have a
> statement in hyphen-base instead of hardcoding its generation.)
I don't understand this paragraph, sorry.

> 2.) Let's assume that the group of extra German hyphenation patterns
> (or some third person) won't be willing to release a new version with
> "plain patterns" by TL 2010 release date and that you will still want
> the pattern to be loaded at format-generation time. What do you do in
> such a case?
I just don't change anything in dehpyh-exptl.tlpsrc.

> You could have:
>    mode="disabled" -- completely disabled (like ibycus)
>    mode="format" -- dump into format at format-generation time

We don't need mode="format". Just no entry in language.dat.lua does the trick.

>    mode="enabled" -- most languages
>    (mode="empty" -- you could have that for arabic unless you prefer
> some other method)
There already is special='null', we don't need mode='empty'. Also,
mode="enabled" is redundant: a language is enabled if and only if there are
appropriate files for it in the database.

> 3.) What happens if somebody modifies language.dat.lua without
> remaking the format? Where are the languages stored/do you store data
> into format or read language.dat.lua at every luatex run?
Depends how it is modified:
- modify an existing entry: no problem.
- deleting or adding and entry: format needs to be re-build, since the presence
of an entry in language.dat.lua determines whether a language is dumped into the
format or not.

> ======
> 1.) txtpatt, txthyph
> What do you think about
>     file => file_loader (or it may also stay at the old name)
>     txtpatt => file_patterns
>     txthyph => file_exceptions
> ?
Do we actually care?! Most tlpsrc files (those from hyph-utf8) will be
machine-generated. The few others will be edited by the TL team. Anyway, I
prefer not changing "file" since it already exists.

> 2.) In case that there are no hyphenation exceptions, I can use
>     file_exceptions=nil
> Would that make sense to you or do you prefer loading an empty file?
Both are already supported, so please do whatever you prefer.

Note, however, that the correct way to obtain this, is to put no txtpatt entry
in the tlpsrc for this line. Alternatively, you can put
txtpatt= or txtpatt='' (but *not* txtpatt=nil) if you want to make it more
explicit. The reason is simple: all fields in the generated language.dat.lua are
of type string.

> 3.) For farsi/arabic one could also have
>     file_loader=zerohyph.tex
>     file_patterns=nil (or file_patterns="empty" if you want :)
>     file_exceptions=nil
> to signal that there are no exceptions and no patterns.
Sure, it is supported, see luatex-hyphen.pdf (but see the remark above
concerning tlpsrc syntax). The following bits of tlpsrc lines are equivalent:

file=foobar.tex txthyph= txtpatt=
file=foobar.tex txthyph='' txtpatt=''
file=foobar.tex txthyph=
file=foobar.tex txtpatt=
file=foobar.tex databases=def,dat,lua

Note that in the last version, you need to explicitly force the language to have
an entry in language.dat.lua, otherwise it won't have one. The following line
excerpt is almost equivalent:

file=foobar.tex luaspecial=null

the only difference begin that, the first time the language is activated, the
log will read

Loading (null) hyphenation and exceptions for ...

instead of standard

Loading hyphenation and exceptions for ...

so it is slightly preferred. But I can make all these case exactly equivalent if
you prefer.

> 4.a.) Now that you have modified TL tools that are controlling
> generation of language.dat.lua you can finally be sure that
> language.dat should not contain any other language but the ones in
> language.dat.lua, so why do you still want to read language.dat and
> check if "maybe there was someone else who has added something to
> language.dat"?
For two reasons:
- one is technical: not reading language.dat at all means more changes to the
babel code than I'd like to do right now (or ever) (not to mention etex.src and
- the other reason is, it allows us to very easily continue to support the "old"
way of loading language, thus keeping full backward-compatibility both in TeX
code and in tlpsrc format.

> 4.b.) You still need to think about Akira (W32TeX) and CS (MikTeX),
> even though MikTeX doesn't support LuaTeX yet. I would find it a
> nightmare if you would require from MikTeX to mimick TL's on-the-fly
> generation of language.dat.lua, if nothing else because there's no way
> that you could control that.
We don't force anything. MikTeX can choose to use a monolithic language.dat.lua
if they want to: the current code allows that.

(Anyway, MikTeX probably has code for generating language.dat and language.def
(imposed by babel and etex.src), so I doubt generating one more similar file
would be a big problem for Christian. But that's not my point. My point is,
nothing in the current state of things forces him to do so.)

> 5.) ibycus
> luaspecial="disabled:only usable in 8bit engines"
> We may keep luaspecial, but once all the other aspects are considered:
> do we really need one with ibycus? My order of preference:
> a) don't include it in language.dat.lua file at all (controlled with
> engines= or enable_8bit=...,enable_utf/luatex/xetex)

Are you aware it means dumping it in the format???

> Yes, I probably missed other points, but that should be enough for the
> first round (I did not look into TeX code at all nor do I plan to do
> so).
You don't need to look into TeX code. I'll try to improve the documentation for
the various parts if you thinks it's not explicit enough.


More information about the tex-hyphen mailing list