[tex-hyphen] Using UTF-8 Word Lists to Create Hyphenation Pattern Files

Petr Sojka sojka at fi.muni.cz
Wed May 23 19:42:15 CEST 2012


On Wed, May 23, 2012 at 04:48:03PM +0100, Jonathan Kew wrote:
> On 22/5/12 22:57, Steven Dickson wrote:
Hello,

> >I work for a religious organization that produces publications in
> >several languages spoken by our members throughout the world. Over the
> >years, we have developed UTF-8 encoded hyphenated word lists for 91
> >different languages. We use these word lists to create proprietary
> >hyphenation software. We would like to use these lists to create
> >hyphenation pattern files that can be used with more traditional
> >software such as TeX and OpenOffice applications.
> 
> I think that at present the state of pattern-generating tools is
> pretty woeful, but that could in principle be changed by some
> motivated developers.
> 
> Are you happy to distribute these hyphenated word lists in some way?
> If you were to make them available (under a simple, non-restrictive
> license such as BSD), it might be more likely that people in the
> free software community would be inspired to tackle the work that's
> needed to derive TeX- and OpenOffice-compatible (or other) resources
> from them.

I second that (as a patgen user and advisor of David Antos's thesis).

> >It appears that hyphenation pattern files are being created by patgen
> >using tokenized word lists then converting the final output to UTF-8.
> >Unfortunately, we are dealing with some complex languages that will
> >exceed the 256 character limit of patgen.

Well, the limit of patgen actually is 256 different character
_clusters_ (in a cluster there are characters that have the
same role during hyphenation, e.g. lowercase and 
uppercase character). Which language do you have in mind that
needs more than 256 character clusters (English has 26 clusters,
Czech 32,...)?

> >Like others, I have unsuccessfully tried to build opatgen with the
> >current version of gcc. Trying to find gcc version 2.96 in hopes that it
> >will work doesn?t make sense, especially when there are reports that
> >opatgen has some serious reliability and performance issues. I applaud
> >David Antos for his research and development of opatgen and find it
> >fascinating that his work has not been adopted and enhanced by the open
> >source community.
> >
> >Is using patgen with tokenized word lists and converting the output to
> >UTF-8 really the only viable way to create pattern files?
It is the quickest way for most languages unless someone takes
care of further opatgen maintenance.

Petr Sojka

> >Steve Dickson
> >The Church of Jesus Christ of Latter-day Saints
> >Publishing Services Department
> >50 East North Temple Street
> >Salt Lake City, Utah 84150
> >Email: DicksonSK at ldschurch.org <mailto:DicksonSK at ldschurch.org>


More information about the tex-hyphen mailing list