[XeTeX] Word wrapping in Lao

Andrew Cunningham lang.support at gmail.com
Fri Apr 16 14:34:20 CEST 2010


The south-east asian scripts I tend to work with at the moment, break at:

* punctuation
* phrase boundaries (indicated by white space)
* word boundaries (no spaces at word boundaries, except when a word
boundary is a phrase boundary)

Word segmentation would need a dictionary lookup, probably using
longest word matching.

A last resort approach for some languages is syllable boundary segmentation.

You need line breaking rather than hyphenation per se.

For some input systems in Lao and Khmer, the input software will
insert ZWSP into the text. Although use of ZWSP has been problematic
in the past with justified text and some software breaking complex
rendering or inserting a visible space when justifying text.

For a S'gaw Karen (Myanmar script) project we're working on at the
moment, both web and print publications, we're developed a keyboard
layout which will automatically insert ZWSP at syllable boundaries,
under certain circumstances.

A regex statement could be used, although getting the exceptions right
would be important.

An alternative approach would be to loosely use the Chinese and
Japanese models, i.e. characters that can't start/end a line. But this
approach would be simpler for some languages than others.

Andrew

On 16 April 2010 19:31, Philip TAYLOR <P.Taylor at rhul.ac.uk> wrote:
>
>
> Arthur Reutenauer wrote:
>
>>   This is exactly the problem that TeX's hyphenation algorithm was
>> developed for.  It's exactly as you write: you give a list of rules
>> describing where you can and you can't break words ("hyphenation
>> patterns") and TeX does the job of finding the "nicest" authorized break
>> for you.
>>
>>   I'm responsible with Mojca Miklavec for maintaining the hyphenation
>> patterns in TeX Live; if you can describe the rules more precisely we
>> can add patterns for Lao, Thai and Khmer to the set of patterns we
>> already have (and it's already quite big, coming from several dozens of
>> contributors all over the world).  Mojca added patterns for all the
>> major languages of India last month but we have no languages from
>> South-East Asia yet. I've always understood the word-breaking rules were
>> very different from other languages but I suppose the same mechanism
>> could be adapted; you only need to bring the linguistic knowledge!
>
> I agree with your analysis (and thought much the same), but
> there is a complication : TeX breaks lines only at spaces unless
> it hyphenates a word (default behaviour); what I understand from
> Brian's original message (Brian : please correct me if I am wrong)
> is that Lao breaks between character pairs rather than at spaces,
> and that no hyphenation occurs.  Which made it a fascinating
> challenge and well worthy of attention :-)
>
> ** Phil.
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>



-- 
Andrew Cunningham
Senior Project Manager, Research and Development
Vicnet
State Library of Victoria
Australia

andrewc at vicnet.net.au
lang.support at gmail.com



More information about the XeTeX mailing list