[XeTeX] [tex-hyphen] Help with UTF-8 Language

Fri Oct 10 07:51:53 CEST 2014

> My programming abilities are quite limited and I realize there
> aren't many people who need to make hyphenation dictionaries, hence
> the lack of good Unicode support.  But would someone be willing to
> help with a little more step-by-step help?  I am a little confused
> as how best to map the Khmer Unicode characters to 8-bit values.

Unfortunately I don't have time to write a Perl or Python script for
you, but it should be straightforward to program a small filter that

  (a) converts from UTF-8 to UTF-16
  (b) converts from UTF-16 to the ad-hoc 8bit encoding by stripping
      off the higher byte

Ditto for another filter that does exactly the opposite.  Both Perl
and Python come with routines to do the UTF-8 <-> UTF-16 conversions,
BTW.

> I think it would be quite useful to post a tutorial of the process
> once I am done so others can more easily create hyphenation
> dictionaries for languages that don't have them yet (I have yet to
> find a good tutorial anywhere).

As mentioned earlier, you should examine the stuff in our `wortliste'
project.  The only difference is that we are converting forth and back
between UTF-8 and latin-9 – at that very place, you should convert
forth and back between UTF-8 and the ad-hoc 8bit encoding.  Your
`khmer.tr' file should also be converted from UTF-8 to the ad-hoc
encoding, BTW.

    Werner