[tex-hyphen] tex-hyphen Digest, Vol 58, Issue 3

Fri Oct 10 12:58:16 CEST 2014

> 
> Message: 4
> Date: Fri, 10 Oct 2014 08:54:04 +0700
> From: Nathan Wells <sungkhum at gmail.com>
> To: "About TeX hyphenation patterns." <tex-hyphen at tug.org>
> Cc: Unicode-based TeX for Mac OS X and other platforms <xetex at tug.org>
> Subject: Re: [tex-hyphen] Help with UTF-8 Language
> Message-ID:
> 	<CAFSe7HTPaagZyr4OcP5Bn3davp+MW30_qi-TmK70XkfoPN0HbQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Thank you all for your replies!
> My programming abilities are quite limited and I realize there aren't many
> people who need to make hyphenation dictionaries, hence the lack of good
> Unicode support. But would someone be willing to help with a little more
> step-by-step help? I am a little confused as how best to map the Khmer
> Unicode characters to 8-bit values.
> I think it would be quite useful to post a tutorial of the process once I
> am done so others can more easily create hyphenation dictionaries for
> languages that don't have them yet (I have yet to find a good tutorial
> anywhere).
> Thanks again for your help,
> Nathan
Hi Nathan

step 1.
First you need word database in your language.
I wrote small program for my case which accepts text file(it can be text with mixed scripts) and
gets from there  words in some "Lang" sorts them and outputs in file.
another code merges this wordlists.
finally you need something like this:

Aggressive
Animal
Alphabet
Dosimeter
Guard

if you can make such list in other way thats fine.

Step 2
after this you need to know hyphenation rules.
This can be different from language to language
In example for my case i can hyphenate word on after vowel,
if there are two or more consonants after vowel one stays on same line others
go on next line, but there are some consonant pairs which can not be splitted.

After doing this with your wordlist you get something like this:
splitted_word_list.txt
Ag-gres-si-ve
A-ni-mal
Al-pha-bet
Do-si-me-ter
Gu-ard

Step 3.
after this comes patgen and you pass splitted_word_list.txt as dictionary file
for  'hyph_start' and 'hyph_finish' left hypmin righthypmin you can use 3*N. "3" because
Khmer is 3 bytes long. Using this trick i made patgen to work with utf-8.  

I used wordlist from step 1 and generated patterns from step 3 to test hyphenation using hyph-utf8 and luatex and
compared it to splitted wordlist from step 2.

For step 1-2 i have wrote program which does all work. Unfortunately script codes(Language script detection, hyphenation rules, vowels) are "hardwired" in code.

I can send you codes and you can modify them or send me textfiles with your language /text can me mixed with some other languages or with html murkup, but not word files please :) / , vowels list /unicode codes/ and hyphenation rule set.
I'll try modify my code in way program can accept: script_code_ranges, vowels_set, consonant_pairs_which_cannotbesplited

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/tex-hyphen/attachments/20141010/4694fa9c/attachment.html>