[XeTeX] Word wrapping in Lao

Sam Putman atmanistan at gmail.com
Wed Apr 21 18:26:36 CEST 2010

2010/4/21 François Charette <firmicus at ankabut.net>:
>> For computational linguistic applications, where the wrong word boundary
>> results in a mis-parse, I believe that finding "correct" word boundaries
>> is
>> still a research problem, and cannot be solved by dictionary lookup alone.
>> For Thai, which is (I believe) similar to Lao in this respect, you might
>> have a look at this:
>>    http://www.cs.cmu.edu/~paisarn/software.html
>> which implements three algorithms: Longest Matching, Maximal Matching and
>> Part-of-Speech Bigram. That's a bit old, but it gives some idea of the
>> depth of the problem.  Or there's a comparison of different approaches for
>> Thai (which I believe dates from 2008) here:
>>    http://www.cs.ait.ac.th/~mdailey/papers/Choochart-Wordseg.pdf
>> If you want more, try googling 'word segmentation thai' (you can google
>> for Lao too, but it appears there has been much more research on word
>> segmentation for Thai).
> Thanks for these interesting links.
> I am also aware of this:
> http://linux.thai.net/pub/thailinux/cvs/software/cttex/ (which is also
> packaged in Debian). It is another dictionary-based tool for finding Thai
> wordbreaks. I have actually used it to generate wordbreak macros in the file
> example-thai.tex that comes with polyglossia. I don't know which algorithm
> it relies upon (probably "longest matching"). However the approach suggested
> by Jonathan (namely the ICU implementation via \XeTeXlinebreaklocale "th")
> may actually be superior to the above. It is in any case the most convenient
> one for XeTeX users, as it relieves from the necessity of using a
> preprocessor.

This document had some interesting contributions to this discussion:


In short, there are other people interested in solving this problem to
provide proper internationalization in the major FLOSS applications.

Here is a research paper on Lao linebreaking in particular:




More information about the XeTeX mailing list