[tex-hyphen] UTF-ization of hyphenation patterns

Mojca Miklavec mojca.miklavec.lists at gmail.com
Thu May 15 19:24:47 CEST 2008


Hello,

to revive this mailing list a bit ...

In ConTeXt, changes have happened earlier independently, but in main
TeX distributions (TL, MikTeX), a major change in hyphenation pattern
loading has happened approximately 1,5 years ago when Jonathan Kew had
to adapt the patterns for the new "unicode world".

Currently LaTeX loads xu-XXhyphen.tex, where XX stands for some
language. xu-XXhyphen checks if patterns have been loaded by XeTeX
a) if not, XXhyphen is loaded in the same way as it was before
b) but if XeTeX is reading the patterns, some clever tricks are used
to convert cryptic code from original files ("a, "o, "u, /3, /s, /x,
^^ba, ...) into UTF-8 which XeTeX is then able to understand

This year, LuaTeX is also on its way. Even though not yet so powerful
when used in LaTeX, it might need proper Unicode hyphenation patterns
at some point as well.

To clean up some mess in loading the patterns, it would be nice to go
the other way around, and start from proper Unicode (UTF-8) patterns
and let (pdf)TeX interpret UTF-8 patterns in its own way instead of
both XeTeX and LuaTeX having to deal with some really weird encodings
in patterns.

Karl Berry has set a new repository on
    svn://tug.org/texhyphen
to start the transition to a new, UTF-8 based solution. The macros
were mostly written by Jonathan and Taco, while I have converted some
patterns to UTF-8 and am planning to convert some more (anyone is
welcome to help). Wrappers still need to be written for most languages
(there is only one example there at the moment), but I might write a
short script to auto-generate them, since all the ugly code that is
now part of those XXhyph.tex, has been moved outside.


The main idea was:

- do not modify any file with the old patterns - leave all the old
files intact on CTAN and keep all those files on TeX Live as well;
some tools might still depend on them and it's nice to have some
historic evidence; however, start moving them to "obsolete"

- generate new UTF-8 patterns out of those files with consistent naming scheme

- create new wrappers that operate in the opposite direction as
Jonathan's xu-XXhyph.tex; once done, drop those xu-XXhyph.tex files

- once done, change entries is language.dat to use new wrappers, but
this year only modify those entries for which we can be sure that they
will work (German is not one of them, but I can sacrifice Slovenian
patterns, and some more are ready); for next TeX Live, we can complete
the list, but this year we can use only a subset to make sure that
even in the worst scenario of breaking something, we don't break all
the patterns at once

- once done, promote the changes, notify authors, CTAN, magazines, etc.

- the change should not influence line breaking in pdfTeX in any way;
the approach keeps all the functionality, it only cleans up the mess

Some will probably ask - what about licencing. I'm not an expert, but
we're not going to modify the original files in any way, we are only
making the same thing as OpenOffice did years ago when they have
converted the patterns from TeX into their own format, and added a
README file next to each file. The idea is to generate new files, and
add credits to the original authors. The next step would be to promote
the changes and explain to both authors of patterns and CTAN team that
new patterns are only accepted in UTF-8 format, so if anyone would
like to upload a new version of patterns (based on new dictionary),
one should submit them in this new format

The new approach uses:
- pattern files in UTF-8, clean, without any TeX macros, one pattern per line
- possibly the same for exceptions
- a converter between UTF-8 and ec/texnansi/qx/t5/...
- pattern loader that recognises if engine is utf-8-capable, and
either preloads the converter (for pdfTeX) or loads raw patterns
(XeTeX, LuaTeX)
- a file for each language that calls pattern loader with proper
parameters - language and encoding (in future it should be possible to
get rid of those files as well - language.dat with some macros could
take care of proper loading)

The naming scheme and macros are far from being set in stone, but they
already work.
I would like to hear some suggestion concerning LuaTeX which is
capable to handle some more advanced patterns.

Comments and suggestions welcome.

Mojca


More information about the tex-hyphen mailing list