[tex-hyphen] Help with UTF-8 Language

Philip Taylor P.Taylor at Rhul.Ac.Uk
Thu Oct 9 14:41:39 CEST 2014

Nathan Wells asked  :

>> I am not sure if this is the right place to ask, but I am trying
>> to create hyphenation rules for a UTF-8 language (Khmer). I've
>> tried patgen, but I can't get it to work (some have said it
>> doesn't support UTF-8?).

to which Werner Lemberg replied via Stack Exchange.  Since not all
subscribed to these lists will necessarily also read Stack Exchange (I
don't, for example), I have repeated some parts of his answer here, to
which I have added related questions of my own.  I have also opened up
the distribution to the XeTeX list, since it seems extremely relevant

> First of all, whatever you are going to achieve, it won't work with
> ‘classical’ TeX. This is due to a design decision of Knuth – today we
> know that this was unfortunate, but at the time of writing TeX this
> was far less obvious: Hyphenation patterns are applied to glyph
> indices and not to input character codes. Since there are more than
> 256 Khmer ligature glyphs, the standard hyphenation algorithm can't
> be applied.
> Today, this design problem can be circumvented natively by luatex
> only, 

Does XeTeX also address this issue (open question, not one to which I
claim to already know the answer) ?

> Now back to your problem. The patgen program is completely agnostic
> of what it processes; the only limitation is that it cannot handle
> more than 243 entities: The 8bit range of 256 characters minus the
> digits 0-9 and characters ‘.’, ‘-’, and ‘*’ (which can be mapped to
> different characters if necessary). Since the number of Khmer
> characters is less than 128, patgen can be used to create patterns.

OK, so let's open up the question from just Khmer :  if I were to want
to build patterns for a language that had more than 243 characters, is
there a variant of Patgen that can correctly handle such a task ?

Philip Taylor

More information about the tex-hyphen mailing list