[tex-hyphen] UTF-ization of hyphenation patterns

Mojca Miklavec mojca.miklavec.lists at gmail.com
Thu May 15 22:39:27 CEST 2008

On Thu, May 15, 2008 at 9:35 PM, Hans Hagen wrote:
> Mojca Miklavec wrote:
>> - create new wrappers that operate in the opposite direction as
>> Jonathan's xu-XXhyph.tex; once done, drop those xu-XXhyph.tex files
> anf forget about generic wrapper ... my experience is that all macro
> packages do things slightly different; so you need wrappers for plain
> and for latex; context has its own code

Sure, the wrapper is for LaTeX and plain TeX only. It makes no sense
to use it in ConTeXt - xu-XXhyph.tex files (the ones that need to be
replaced) have never been used in ConTeXt either. The wrappers are
there to help users of LaTeX and plain, but ConTeXt can still profit
from being able to forget about lines like:

            when 'de', 'deo' then
                demap(/\\c\{/, "\\delete{")
                demap(/\\n\{/, "\\keep{")
                remap(/\\3/, "[ssharp]")
                remap(/\\9/, "[ssharp]")
                remap(/\"a/, "[adiaeresis]")
                remap(/\"o/, "[odiaeresis]")
                remap(/\"u/, "[udiaeresis]")
            when 'agr' then
                # bug fix
                remap("a2|", "[greekalphaiotasub]")
                remap("h2|", "[greeketaiotasub]")
                remap("w2|", "[greekomegaiotasub]")
                remap(">2r1<2r", "[2ῤ1ῥ]")
                remap(">a2n1wdu'", "[ἀ2ν1ωδύ]")
                remap(">e3s2ou'", "[ἐ3σ2ού]")
                # main conversion
                remap(/\<\'a\|/, "[greekalphaiotasubdasiatonos]")

and being able to use any other patterns out of the box without having
to decrypt the funny conventions for each language separately. Once
done, it's done for everyone: for XeTeX & LuaTeX, for ConTeXt, and
purely theoretically, OpenOfficers would have one step less to do as

Also, it makes no sense for ConTeXt to drop the shipped patterns. I'm
much more happy that I do not need to worry about having to provide
the proper version of patterns in the distribution (taking care for
providing the proper binary and proper fonts is already fun :) And if
you take into account that users update their ConTeXt in olther
distributions - no way to drop your files. In LaTeX, patterns are part
of distribution. In ConTeXt, they need to be part of your package
since you never know what will be available in the distribution, esp.
if file names are changed again.

>> - a converter between UTF-8 and ec/texnansi/qx/t5/...
> afaik most patterns assume ec before adding numerous files, first figure
> out what is needed (probaby only latex since plain is 7 bit)

>From the first ten patterns that have been converted, one is using qx
(Polish), two are ascii (British English, Italian - they decided that
it's not worth the trouble to have a few accented characters
included), French and German are theoretically EC, but duplicate some
patterns, so that ß and œ work with OT1 (do not ask me why) and
texnansi (a bit ugly way), some are EC, but also work with texnansi
since the few special characters share the same slots in both
encodings (Finnish, Swedish), some are pure EC (Croatian, Slovenian),
the rest is on the waiting list.

I did not take a look into cyrillic, greek, mongolian, ... at least not yet.

I have written a simple script which generates the converters from
UTF-8 into other encodings automatically (given an enc file), so no
worry. The important latin encodings are done, and once I come to
anything else, my understanding of the content will stop very soon
anyway. The plan is to include only some patterns this year. For
exotic and likely-to-break ones, there will be enough time until next
TeX Live.


More information about the tex-hyphen mailing list