[XeTeX] Word wrapping in Lao

François Charette firmicus at ankabut.net
Wed Apr 21 16:05:41 CEST 2010

> For computational linguistic applications, where the wrong word boundary
> results in a mis-parse, I believe that finding "correct" word boundaries is
> still a research problem, and cannot be solved by dictionary lookup alone.
> For Thai, which is (I believe) similar to Lao in this respect, you might
> have a look at this:
>     http://www.cs.cmu.edu/~paisarn/software.html
> which implements three algorithms: Longest Matching, Maximal Matching and
> Part-of-Speech Bigram. That's a bit old, but it gives some idea of the
> depth of the problem.  Or there's a comparison of different approaches for
> Thai (which I believe dates from 2008) here:
>     http://www.cs.ait.ac.th/~mdailey/papers/Choochart-Wordseg.pdf
> If you want more, try googling 'word segmentation thai' (you can google
> for Lao too, but it appears there has been much more research on word
> segmentation for Thai).
Thanks for these interesting links.

I am also aware of this:
http://linux.thai.net/pub/thailinux/cvs/software/cttex/ (which is also 
packaged in Debian). It is another dictionary-based tool for finding 
Thai wordbreaks. I have actually used it to generate wordbreak macros in 
the file example-thai.tex that comes with polyglossia. I don't know 
which algorithm it relies upon (probably "longest matching"). However 
the approach suggested by Jonathan (namely the ICU implementation via 
\XeTeXlinebreaklocale "th") may actually be superior to the above. It is 
in any case the most convenient one for XeTeX users, as it relieves from 
the necessity of using a preprocessor.

BTW, I just checked the latest sources of ICU4C: there is indeed no such 
implementation for Lao yet (nor for Khmer or Myanmar afaics). I am 
however puzzled by the fact that the ICU source tarball does not appear 
to provide a Thai dictionary for word-breaking purposes, even though the 
engine implies the availability of such a dictionary (I expected a file 
like "thaidict.brk" somewhere, which is mentioned in 
source/tools/genrb/genrb.c). Or did I miss something?


More information about the XeTeX mailing list