[tex-hyphen] how to add hyphenation for a new language?

Mojca Miklavec mojca.miklavec.lists at gmail.com
Sun Feb 24 12:55:22 CET 2013


On Fri, Feb 22, 2013 at 1:32 PM, levan shoshiashvili wrote:
>
> Yes, there will be support for Polyglossia. XeTeX/XeLatex work fine. I had
> problems with
> Poliglossia couple years ago,

If you have problems, you can always contact the author.

Working with XeTeX/LuaTeX should be the preferred way to typeset
documents since one could use any given OTF font that supports
Georgian (without having to go through complex process of creating tfm
& other files).

>> For the hyph-utf8:
>> If you point me to the files or if you send the files to me, I will
>> add the patterns to hyph-utf8.
>> (I would be very grateful if you could also send a list of words that
>> you used to generate the patterns.)
>>
> Ok, thanks. I looked how hyphenation patterns are tied with babel and
> noticed that one way
> is to define language<-->hyphenation pattern's file in languages.dat in
> texmf/tex/generic/config/language.def
> and some at some websites/messages suugest
> texmf/tex/generic/babel/languages.dat, but
> now with miktex2.9 and texlive 2011 there is not
> texmf/tex/generic/babel/language.dat, but not all languages are listed here.
> But in my distros i see this file in doc directory :)

Actually, the *only* way is to be able to use hyphenation patterns is
to define the language in language.dat for LaTeX and language.def for
plain (e)TeX.

The only question remains, to *which* "language.dat". Since a couple
of years TeX Live auto-generates language.dat. It provides a special
directive in TeX Live packages. For example, hyphen-french contains

execute AddHyphen name=french synonyms=patois,francais lefthyphenmin=2
righthyphenmin=3 file=loadhyph-fr.tex file_patterns=hyph-fr.pat.txt
file_exceptions=

If you want to enable French hyphenation patterns, you need to install
the package "hyphen-french" and then that line will automatically end
up in language.dat and language.def.

When searching for language.dat on my hard drive, here's the list of
files that have no influence at all (many of them are invalid since
files listed there don't even exist any more):

/usr/local/texlive/2012/texmf-dist/doc/xelatex/xecyr/language.dat.add
/usr/local/texlive/2012/texmf-dist/source/generic/babel/language.dat
/usr/local/texlive/2012/texmf-dist/doc/generic/babel/language.dat
/usr/local/texlive/2012/texmf-dist/tex/lambda/antomega/language.dat.sample
/usr/local/texlive/2012/texmf-dist/tex/lambda/config/language.dat

(Karl, in case that you are reading this: can/should any of these
files be removed?)

This file contains all languages in TL:

/usr/local/texlive/2012/texmf/tex/generic/config/language.dat

And this is the auto-generated file that is actually used:

/usr/local/texlive/2012/texmf-var/tex/generic/config/language.dat

Try `kpsewhich language.dat`. You can add a line to the last file, but
you need to be aware that changes might be overwritten when you update
TeX Live. You can also create your own language.dat and put it
somewhere in texmf-local/tex/generic/config/language.dat, but then you
need to be aware of consequences (installing/uninstalling
hyphen-<language> won't have any effect).

I'm not sure how exactly this works in MikTeX, but in any case you
should find the right language.dat there as well..

> another is hyphen.cfg .

These are just macros that you shouldn't need to worry about.

>> Where can I find the encoding files for TeX, like
>> texmf-dist/tex/latex/base/t1enc.def
>> texmf-dist/tex/latex/base/t1enc.dfu
>> ?
> Don't understand. If you mean for Georgian
> here
> http://ctan.org/tex-archive/language/georgian/texmf-local/tex/latex/georgian
> but there are some mistakes as I wrote. Better version is here
>  http://tex.tsu.ge/files/  geotex-0.6.tgz

I was looking for tex/latex/georgian/t8menc.def, but failed to see it
in the tgz file. I'm sorry, I saw it now.

>> Do you also have any support for writing in XeTeX?
> Minimal example with XeLatex works fine with Georgian.
> I'll add Georgian support for Poliglossia.

Thank you.

> \documentclass[12pt]{article}
> \usepackage{fontspec}
> \usepackage{xunicode}
> \usepackage{xltxtra}
> \setmainfont[Mapping=tex-text]{Sylfaen}\fontsize{16pt}{16pt}
> \begin{document}
> \title{Georgian and XeLatex \\ ქართული და XeLatex }
> \maketitle
> \begin{abstract}
> ტესტი
> \end{abstract}
> \section{ნაწილი 1. Section 1}
>
> \fontspec{Sylfaen}\fontsize{12pt}{12pt}\selectfont
>
>  გამარჯობათ
> Hello
>
> \end{document}
>
>
>> Now, on a more serious part of it, if you would really like the
>> patterns to work for 8-bit: to make the patterns work in both UTF-8
>> and 8-bit I would also need the mapping from your encoding to UTF-8 as
>> plain text. Here's an example:
>>
>> http://tug.org/svn/texhyphen/trunk/hyph-utf8/source/generic/hyph-utf8/data/encodings/ec.dat?view=markup
>> but any other format would do as long as I can write a simple script
>> to get the desired format.
> Ok..you can see in enc files and in document how utf-8 is mapped to
> T8M, T8K encoding, I'll send you
> utf8_sequence 8bit_code utf8code info

I also realized that there are enc files with uniXXXX names, so that
should explain a lot as well.

>> Also, before the 8-bit patterns are included it would help a lot to
>> have at least one font in TeX Live that supports that encoding.
> fonst on the site above have those fonts..or you mean to view those 8bit
> tex encoded text files on the screen ?

I meant that it would be nice to have the fonts not just on CTAN, but
also in TeX Live.

It seems reasonable to me to have at least one working font in T8M/T8K
encoding and working babel/polyglossia support included in TeX Live,
else hyphenation patterns in T8M included in TeX Live are not as
helpful to users as they could be.

On one hand, I would prefer to push unicode-proof solutions into TeX
Live (= support for XeTeX/LuaTeX), but on the other, it doesn't help
users much if they have just patterns in TeX Live, but not any single
font to support those patterns.

>> One thing is definitely not clear to me though. You created two
>> encodings, and there's the third(?) encoding used by those older
>> metafont fonts.
> yes i know there are two packages in metafont. I dont use them.
> T8M, T8K ancodings what have introduced follow rules
> mentioned in latexfonencodings guide.
> I have discussion with Werner Lemberg  author of CJK and
> russian package.. There was a choice to have two encodings
> or one big encoding, like cyrillic T2A, T2B and X2.
> I decided that T8M, T8K approach with virtual fonts is better (Pdf file
> is searchable and hyphenation also works "easily").
> See georgianencodings-en.pdf here
> http://tex.tsu.ge/files/
>
> There are 3 scripts in Georgian Old letters , Old Capitals and modern.
> in T2M T2K old letetrs are on uppercase positions and modern
> is in both encodings on same positions--lowercase positions.
> this means that same hyphenation patterns(which are in lowercase script)
> will
> work for both encoding.

OK, that's a good start then. If typesetting documents required a
mixture of both, it would be more problematic.

>>Hyphenation patterns cannot be used with two encodings
>> as you probably know already, so I have a question: what exactly is
>> the relation between the two encodings (in the Georgian part)? You
>> mention that the patterns should be using T8M, but what about texts in
>> T8K?
> same patterns will work with T8K. Just old letters which are in t8k
> uppercase positions are rarely used, mostly in church documents.
>
>
>> Or are the corresponding glyphs hyphenated in the same way? How did
>> those who were using Georgian in LaTeX deal with hyphenation issues
>> until now and which fonts were usually used?
> there was not hyphenation issues until now :) , becouse there was no
> Georgian input/output in 8bit not in utf8 in Tex.

(That's why I suggested to push the usage of XeTeX/LuaTeX from the beginning ;)

> with metafont packages, which are in latex you can input in latin alphabet
> like "gamarjoba--Hello"
>
> Is hyph-utf8 support is enough to have hypenation working in T8M T8K
> encodings.
> It will work for XeTeX/XeLatex/Lua engines as I understand right? This is
> great.

Support for Xe(La)TeX/Lua(La)TeX comes out-of-the-box. Support for
8-bit engines like pdfTeX is more demanding, but doable.

> I'll send you hyphenated wordlist and patterns.

Thank you, I'm looking forward to it.

I'm also playing with Thai now, so I'll try to release both at the
same time. Adapting both needs some time (current source code isn't
sufficient for supporting Thai in the same way as we did in past, at
least not for 8-bit engines; the same might be true for Georgian, but
I need to see the patterns first).

Mojca



More information about the tex-hyphen mailing list