[tex-live] Hyphenation patterns, Unicode, XeTeX, and language.dat

Jonathan Kew jonathan_kew at sil.org
Thu Aug 17 14:51:35 CEST 2006


(Sorry, long message! See end for specific changes proposed in TL.)

I'd like to explore solutions for the problem of loading all the  
various hyphenation patterns in TeX Live when running the XeTeX  
engine (and using Unicode-compliant fonts). This relates primarily to  
LaTeX, though the techniques here could be used by other formats  
(Plain-based or other) too, and may also be helpful as other engines  
move towards greater Unicode support.


First, what exactly are the issues? There are a couple of reasons why  
a working "xelatex" format cannot be built using the existing  
language.dat and pattern files found in TL today:

(1) Patterns are loaded according to a specific font encoding. This  
is how TeX works: the hyphenation rules are applied to sequences of  
font-specific character codes. In XeTeX, we focus on Unicode as the  
current standard for character encoding, but the patterns found in TL  
are designed for various 8-bit font encodings used in the traditional  
TeX world. Therefore, for correct hyphenation of Unicode text, it  
will be necessary to re-encode the patterns to Unicode character  
codes (except in cases, such as English, where the 8-bit character  
codes used already correspond to Unicode values).

(2) Some of the pattern files are stored in pure 7-bit ASCII, using  
escape sequences where it is necessary to represent non-ASCII  
characters; but others are stored in 8-bit encodings such as TeX T1,  
T2a, etc. Because XeTeX defaults to reading input text as UTF-8  
Unicode, byte values >=128 in such files will be taken as part of  
UTF-8 sequences, so special care is needed to interpret such files  
correctly.


While a "global" clean-up/harmonization of pattern files, looking at  
how they manage encodings, loading techniques, catcode and lccode  
usage, etc., would be a Good Thing (IMHO), this would clearly be a  
long-term project, involving interaction with numerous original  
authors or maintainers (some of whom may be difficult to track down,  
or have little current interest). I'd like to see this addressed, but  
at this time I want to tackle the more immediate problem of making  
things work in TeX Live, given the collection of pattern files we  
have today.

My current plan, therefore, is to leave the actual pattern files  
untouched, and provide "wrapper" files that can load them with  
appropriate settings for XeTeX, setting the input text encoding and  
remapping character codes to Unicode as needed.

As an example, consider the file "xu-sihyph.tex". (The "xu-" prefix  
is intended to suggest XeTeX and Unicode, though as other Unicode  
engines become available, this may be extended to support them.)  
Details vary for other wrappers, depending on exactly how the pattern  
file is written and what character coding it assumes, but the general  
idea remains the same.

   --------------------------------------
   % xu-sihyph.tex
   % Wrapper for XeTeX to read sihyph.tex
   % Jonathan Kew, 2006-08-17

   \begingroup

   \input ifxetex.sty
   \ifxetex
     % Define the accent macro " to expand to the required Unicode  
characters
     \catcode`\"=13
     \def"#1{\ifx#1c^^^^010d\else \ifx#1s^^^^0161\else \ifx#1z^^^^017e 
\else
         \errmessage{Hyphenation pattern file corrupted!}%
       \fi\fi\fi}
     \catcode`\"=12 % reset catcode so we can read \lccode etc in  
sihyph.tex
     %
     \let\PATTERNS=\patterns
     \def\patterns{% at the \patterns command in sihyph.tex...
       \endgroup % end group to discard definitions from sihyph
       \begingroup % and start our own (to match \endgroup in sihyph)
       \lefthyphenmin=2 \righthyphenmin=3 % settings from sihyph.tex
       \catcode`\"=13 % activate our definition of " from above
       \PATTERNS % and then load the real patterns
     }
   \fi

   \input sihyph.tex

   \endgroup
   \endinput
   --------------------------------------

This allows the existing Slovenian patterns to be loaded in XeTeX and  
applied to Unicode text. So when creating the xelatex format, we need  
to use a version of language.dat that refers to "xu-sihyph.tex"  
instead of the original "sihyph.tex", and similarly for many of the  
other languages.


However, I want to avoid actually maintaining a second copy of  
language.dat for XeTeX (and figuring out where to put it, so that  
each engine will load the right one); this seems like a recipe for  
confusion, as well as complicating things for texconfig or other  
tools. Users should be able to set a *single* collection of language  
choices for LaTeX (or other formats), regardless of which TeX engine  
they're using at a particular moment.

To allow this, the wrapper file uses ifxetex.sty (from texmf-dist/tex/ 
generic/ifxetex/) to check whether it is being processed by XeTeX. If  
so, it remaps characters to Unicode as needed, and discards unneeded  
definitions from the pattern file; but if read by a standard TeX  
engine, it will simply load the old pattern file without changing  
anything.

Therefore, it is valid for language.dat to refer to the "xu-" wrapper  
file *in all cases*, and the patterns will be loaded in "legacy" mode  
(for whatever font encodings they happen to support) by [pdf]tex  
engines, and as Unicode by xetex.


** Proposal **

I have begun to write "xu-___.tex" wrapper files for the patterns  
currently available in TL (most are trivial), to allow xetex to load  
the existing (non-Unicode) files. I suggest that these wrappers go  
into texmf/tex/generic/xu-hyphen (as a sibling directory to texmf/tex/ 
generic/hyphen).

Then we modify the "language.__.dat" files in texmf/tex/generic/ 
config to refer to the xu- wrapper files (in the cases where one is  
necessary), and the pre-built "language.dat" will change similarly.

The net result will be that standard 8-bit TeX will load exactly the  
same patterns as it currently does (it'll just do some extra \input  
operations during format creation, but this is insignificant), and  
XeTeX will load the same set of patterns, but encoded for use with  
Unicode text.


Before actually making changes to something as central as  
language.dat, however, I'd like to hear any concerns or objections to  
this proposed strategy, or alternative suggestions that could make  
things simpler for us all.

Thanks in advance for any and all feedback!

-- JK



More information about the tex-live mailing list