[tex-hyphen] zerohyph.tex and hyph-utf8

Stephan Hennig mailing_list at arcor.de
Sun Aug 24 21:04:29 CEST 2014


Am 23.08.2014 um 00:49 schrieb Mojca Miklavec:

>> But I couldn't find one in hyph-utf8.  I've looked at the
>> various null.* files in the textmf tree, but all of them contain text
>> comments.  Is there any file in TeX that is guaranteed to be empty?

> If the lua code for loading hyphenation patterns as shipped by
> hyph-utf8 (the part written by Manuel, Élie and Khaled) is not
> flexible enough, we should fix *that*.

There's nothing wrong with hyph-utf8.  My association of an empty file
with package hyph-utf8 is by pure analogy of file zerohyph.tex only.


> What exactly do you want to do?

I'm exploring the idea of pattern driven node list manipulations.
Patterns are applied to the words in a node list and a user can then
manipulate or augment the node list according to the result of the
pattern matching.  Applications are, e.g., non-standard hyphenation,
breaking of wrong ligatures, automatic round-/long-s conversion for
black letter types, etc.

The node list scanning function does two things at once: i) it
recognizes words of a certain language in a node list and ii) it applies
Liang patterns to the words.  Both tasks are done in parallel. Which I
think makes sense, because applying patterns to node lists is what the
package I'm writing is about :-) and it saves one iteration over the
nodes in the list.

Now, I've come across one application (so far) where patterns should be
applied to only parts of a word.  The result of the initial pattern
matching can therefore be discarded and patterns need to be applied
afterwards a second time by the user.  Therefore, in this case, the node
list scanning function could as well receive an empty pattern file as
argument.  I've ask here for an empty file, because I found file
zerohyph.tex, but didn't know it was considered a hack.  I'll solve the
problem in my code.

BTW, my code is publicly available at
<URL:https://github.com/sh2d/padrinoma/>.  There are some examples in
directory examples/.  You might be interested in these:

  examples/luatex/non-standard-hyphenation-german
  examples/luatex/hyphenate-with-explicit-hyphen
  examples/lua/patternize

To run the examples, please see
<URL:https://github.com/sh2d/padrinoma/blob/master/examples/README>.  An
appetizer:

> $ echo demonstration |texlua patternize.lua -p hyph-en-us.pat.txt -v
> spot mins: 2 2
> special characters: '- = .'
> pattern file: c:/texlive/2014/texmf-dist/tex/generic/hyph-utf8/patterns/txt/hyph
> -en-us.pat.txt
> 4938 patterns read.
> 
>  . d e m o n s t r a t i o n .
>    d4e m
>       1m o
>          o2n
>           2n1s2
>          o n3s
>    d e4m o n s
>              s t4r
>               1t r a
>                       2i o
>                     1t i o
>                          o2n
>                  r a t i o n4
>  .0d0e4m0o2n3s2t4r0a1t2i0o0n0.
> demon-stra-tion


If you run the hyphenate-with-explicit-hyphen/ example, you'll notice a
delay during compilation much larger that when you compile the
non-standard-hyphenation-german example.  This is because that is the
example where patterns have to be applied to parts of a word and results
from the initial pattern matching are discarded.  User code loads
another pattern set independently from module code.  Currently, I'm
loading the same pattern set in both places, that is I'm loading it two
times.  To make matters worse, in this example, word particles are not
hyphenated with regular German hyphenation patterns, but with custom
compound word hyphenation patterns, which indicate word boundaries,
e.g., not

  Zwei=Drit-tel=Ab-stim-mungs-mehr-heit
, but
  Zwei=Drittel=Abstimmungs-mehrheit

(= indicate explicit hyphens, - is a hyphen as result of the pattern
matching)

Compound word hyphenation patterns are extremely big (30.000 patterns,
220 kB file size).  Loading these patterns two time causes the delay
during compilation.  This is what I wanted to get around when asking for
an empty file.  In fact, any tiny pattern set serves my purpose.  But a
file name of a completely unrelated language doesn't look too
self-descriptive in code and how does a user know what file name to
chose?  As I've said above, I'll fix it in my code (using '' as file
name for empty patters or something like that).

Best regards,
Stephan Hennig



More information about the tex-hyphen mailing list