[luatex] Support for Thai in LuaTeX

Mojca Miklavec mojca.miklavec.lists at gmail.com
Tue May 14 16:16:31 CEST 2013


I've recently added Thai hyphenation patterns to hyph-utf8. I have a
few questions though (I was already discussing it a bit with Taco with
respect to Lao or some Ethiopic script on some field trip long time


Words in Thai aren't separated with spaces, so one can end up with
potentially infinite strings that TeX considers to be a single word.
According to my understanding there are two problems to be solved:
a) splitting sentences into words
b) syllabification of words

Hyphenation patterns could in principle do both simultaneously, but at
one point (at 64 or 256 characters in LuaTeX) TeX runs into a problem
of "too long word to hyphenate" and simply stops. (I still believe
that the hyphenation algorithm should be able to work on infinite
strings as long as hyphenation patterns are of finite length, but I'm
not comfortable working with TeX sources, and this is a bit off-topic

I thought at first that ICU library in XeTeX does both, but I was told
that it only does word-splitting, so hyphenation still remains to be
done. (Honestly, I don't see why it couldn't do syllabification in
addition to word splitting since determining boundaries of syllables
must be an easier problem, but I might be wrong.)

In pdfTeX the problem is solved by running a special program "swath":
- http://linux.thai.net/projects/swath (currently broken site)
- http://www.cs.cmu.edu/~paisarn/software.html (broken link)
which reads the input file and creates an output tex file with a
command sequence \wbr insterted between words. After that TeX can do
its job to hyphenate separate words easily, but that requires an
external preprocessor and "latex" as such cannot be run


My question is: are there any plans or visions about how the problem
should be tackled in LuaTeX in the most elegant way, given the absence
of the ICU library?

I could also ask differently: suppose that a motivated Thai programmer
would be willing to work on solving the problem properly. What would
be the suggested solution?

Thank you,

More information about the luatex mailing list