[XeTeX] [tex-hyphen] Help with UTF-8 Language
Werner LEMBERG
wl at gnu.org
Fri Oct 10 07:51:53 CEST 2014
> My programming abilities are quite limited and I realize there
> aren't many people who need to make hyphenation dictionaries, hence
> the lack of good Unicode support. But would someone be willing to
> help with a little more step-by-step help? I am a little confused
> as how best to map the Khmer Unicode characters to 8-bit values.
Unfortunately I don't have time to write a Perl or Python script for
you, but it should be straightforward to program a small filter that
(a) converts from UTF-8 to UTF-16
(b) converts from UTF-16 to the ad-hoc 8bit encoding by stripping
off the higher byte
Ditto for another filter that does exactly the opposite. Both Perl
and Python come with routines to do the UTF-8 <-> UTF-16 conversions,
BTW.
> I think it would be quite useful to post a tutorial of the process
> once I am done so others can more easily create hyphenation
> dictionaries for languages that don't have them yet (I have yet to
> find a good tutorial anywhere).
As mentioned earlier, you should examine the stuff in our `wortliste'
project. The only difference is that we are converting forth and back
between UTF-8 and latin-9 – at that very place, you should convert
forth and back between UTF-8 and the ad-hoc 8bit encoding. Your
`khmer.tr' file should also be converted from UTF-8 to the ad-hoc
encoding, BTW.
Werner
More information about the XeTeX
mailing list