[tex-hyphen] Unicode Turkish Hyphenation Pattern

Mojca Miklavec mojca.miklavec.lists at gmail.com
Wed Jun 25 17:02:51 CEST 2008


On Wed, Jun 25, 2008 at 7:37 AM, S. Ekin Kocabas wrote:
> Hi,
>
> I recently converted the Turkish hyphenation pattern "trhyph.tex" into
> unicode. The file is attached. This pattern set includes only those
> characters used in the current, modern Turkish alphabet (
> http://en.wikipedia.org/wiki/Turkish_alphabet ). The pattern file at
> http://www.ctan.org/get/language/hyph-utf8/tex/generic/hyph-utf8/patterns/hyph-tr.tex
> has patterns for those characters which are not being used in current
> Turkish. The older characters may still be useful to those who deal with
> historical texts, but not really for the current day Turkish writer.
>
> I hope this file can be turned into one which may be included in the
> hyph-utf8 package. Let me know if I can help in any way.

Wow!

That has been the first time when the initiative came from the actual
pattern writers :)
Almost before releasing & announcing anything.
(Usually we had to ask or beg the authors to do any changes or updates.)

I have already written quite some notes about Turkish patterns and
about the funny conversion that has been done.

% Original patterns need special fonts to be able to use them,
% and use several tricks that only enable same hacks to work properly.
%
% I did not dive into details of conversion, but modern Turkish only uses:
% - 4 "special" vowels: ıiöü
% - 3 additional consonants: çğş
% while the converted patterns also use:
% - acircumflex: â
% - ocircumflex: ô
% - ntilde: ñ
% One could argue that these are used just because they can,
% but on the other hand there is no zdotaccent (ż).
%
% The conversion misencoded dotlessi into ^^11 (should be ^^19).

And then additional comments like

% s
% s!  ṣ  -- U+1E63   LATIN SMALL LETTER S WITH DOT BELOW
% s=     --    s with line below - not even in unicode
% s:  ş  B3 U+015F   LATIN SMALL LETTER S WITH CEDILLA
% t
% t!  ṭ  -- U+1E6D   LATIN SMALL LETTER T WITH DOT BELOW
% t=  ṯ  -- U+1E6F   LATIN SMALL LETTER T WITH LINE BELOW

The situation is as follows:
1.) original file has been generated from
% - http://www.ctan.org/tex-archive/language/turkish/hyphen/turk_hyf.c
% - http://www.tug.org/TUGboat/Articles/tb09-1/tb20mackay.pdf

2.) The patterns are useless with anything but a special version of
fonts that are nowhere to find. (It's similar to ibycus encoding, but
ibycus encoding has some packages that know how to cope with it, while
Turkish doesn't.)

3.) A somewhat faulty conversion has been done in 1996, claiming
"conversion into Modern Turkish", but it left three old characters
there and misencoded one.

4.) The three additional characters do absolutely no harm - it's no
problem if they are left there, but I agree to drop them.

5.) I would vote for "renaming" the old file into hyph-ota.tex, leave
all the original characters there (not only the EC ones), but not
adding it to language.dat. If anyone needs the file, he's free to use
it, but it won't be assigned a code/name until people request it.

Type: language
Subtag: ota
Description: Turkish, Ottoman (1500-1928)

One argument against the change is that one can still use some subset
of Otoman Turkish with the current patterns with no harm to moderm
Turkish. For proper support, one either needs all the patterns or
none. Partial solutions are always bad since one pretends to have
support, but the support is full of holes. Sometimes it's better not
to pretend that the support is there at all.

6.) It would be nice to fix or rewrite the source for patterns
generation from scratch, but not a top priority.

7.) Can someone explain why "s with line below" has no unicode point?
(Yes, I know that it won't be added, ...)

Ekin - do you by any chance know any programming language to update
the source for generating patterns?

Unless someone objects, I would rename the old patterns and clean the
not-needed characters in Turkish ones. But we need to remove all the
\lccode and \catcode commands.

Thanks a lot for the reminder,
    Mojca


More information about the tex-hyphen mailing list