[tex-hyphen] Help with UTF-8 Language
Jonathan Kew
jfkthame at gmail.com
Thu Oct 9 21:54:35 CEST 2014
On 9/10/14 13:41, Philip Taylor wrote:
>
>
> Nathan Wells asked :
>
>>> I am not sure if this is the right place to ask, but I am trying
>>> to create hyphenation rules for a UTF-8 language (Khmer). I've
>>> tried patgen, but I can't get it to work (some have said it
>>> doesn't support UTF-8?).
>
> to which Werner Lemberg replied via Stack Exchange. Since not all
> subscribed to these lists will necessarily also read Stack Exchange (I
> don't, for example), I have repeated some parts of his answer here, to
> which I have added related questions of my own. I have also opened up
> the distribution to the XeTeX list, since it seems extremely relevant
> thereto.
>
>> First of all, whatever you are going to achieve, it won't work with
>> ‘classical’ TeX. This is due to a design decision of Knuth – today we
>> know that this was unfortunate, but at the time of writing TeX this
>> was far less obvious: Hyphenation patterns are applied to glyph
>> indices and not to input character codes. Since there are more than
>> 256 Khmer ligature glyphs, the standard hyphenation algorithm can't
>> be applied.
>>
>> Today, this design problem can be circumvented natively by luatex
>> only,
>
> Does XeTeX also address this issue (open question, not one to which I
> claim to already know the answer) ?
When working with Unicode and OpenType fonts, XeTeX applies hyphenation
to the characters of the text, not to glyph indices in a font. So the
number of *characters* (not glyphs) involved in Khmer should not be a
problem for creating XeTeX-compatible Unicode patterns with patgen,
perhaps by using the trick of mapping the Khmer Unicode characters to
8-bit values, generating patterns, and then mapping the result back to
real Unicode.
>
>> Now back to your problem. The patgen program is completely agnostic
>> of what it processes; the only limitation is that it cannot handle
>> more than 243 entities: The 8bit range of 256 characters minus the
>> digits 0-9 and characters ‘.’, ‘-’, and ‘*’ (which can be mapped to
>> different characters if necessary). Since the number of Khmer
>> characters is less than 128, patgen can be used to create patterns.
>
> OK, so let's open up the question from just Khmer : if I were to want
> to build patterns for a language that had more than 243 characters, is
> there a variant of Patgen that can correctly handle such a task ?
That would presumably be opatgen, but some work may be needed to get it
to compile and run on current systems. (Presumably it used to work on
some system or other, at some point in the past.)
JK
More information about the tex-hyphen
mailing list