[tex-hyphen] A strange special case in Turkish hyphenation patterns for TeX

Şükrü Ekin Kocabaş ekocabas at ku.edu.tr
Sat Sep 23 16:23:10 CEST 2017


This email is a rather belated response to the point raised by Alex
Kapranoff on Oct 7, 2015 (sorry Alex!). Please see the end for a reminder
copy of the relevant email.

I will first share my thoughts on the point raised by Alex. I will then
raise a few other points that I have encountered while using the Turkish
hyphenation patterns.

I agree with Alex that if  "2e2cek." is included, it makes sense to also
include "2a2cak." so as to apply the rule uniformly on words such as
"ge-le-cek", "ka-la-cak", "su-na-cak", "se-çe-cek" etc. Note that, for
these specific four words the application of the "2e2cek." and "2a2cak."
rules prevent the last hyphen (hyphenation occurs only at odd values), that
is, they lead to "ge-lecek", "ka-lacak", "su-nacak", "se-çecek". Whereas in
Turkish the correct hyphenation would have both hyphens, I think McCay
makes a judgement call and says that "...nor do we really want the cek of
-ecek broken off if it is at the end of a word". I looked at the
recommendations of the Turkish Language Association [Türk Dil Kurumu = TDK]
on the issue [link to page
<http://tdk.org.tr/index.php?option=com_content&view=article&id=208:Hece-Yapisi-ve-Satir-Sonunda-Kelimelerin-Bolunmesi&catid=50:yazm-kurallar&Itemid=132>
(in
Turkish)] and they only specify that hyphenation should not result in a
single letter at the beginning or at the end of a line. According to this
rule, there is nothing wrong with splitting from -cek and leaving the -cek
at the beginning of the next line. As a result, I think that either
"2a2cak." should be added or "2e2cek." should be deleted to have a
consistent set of rules.

***

Having read the hyphenation rules of TDK, I realized that the current rules
do not hyphenate at apostrophes. The correct hyphenation of words with
apostrophes in them would be as follows (copied from the link above).

...................................................................................................
Edirne’
nin...

..................................................................................................
Ankara’
dan...

.....................................................................................................
1996’
da...

An important point is that after splitting the word, no hyphen character is
used when the split occurs at the apostrophe. I am not sure whether such a
rule can be implemented with the current hyphenation algorithms. German
seems to have similar requirements as the link below suggests. The use of
the \allowbreak command manually could be a solution.

https://tex.stackexchange.com/questions/26174/allow-line-break-but-without-inserting-a-dash

Also, the fact that there are multiple apostrophe symbols should be taken
into consideration.

***

Similar to the apostrophe, hyphenation with the em dash is also
problematic. I do not know what the best approach to hyphenate words with
em dash in them, but the following pages suggest some workarounds.

https://tex.stackexchange.com/questions/130687/redefining-the-emdash-so-as-to-allow-hyphenation

https://tex.stackexchange.com/questions/56657/hyphenation-problem-with-versus-textemdash

Would it make sense to add patterns to with em dash in them?

***

In addition to the patterns generated by McCay, the following paper lists
other patterns for Turkish (take a look at Sec. 5: Syllabus Structure of
Turkish Language).

Güney Gönenç. 1973. Unique decipherability of codes with constraints with
application to syllabification of Turkish words. In Proceedings of the 5th
conference on Computational linguistics - Volume 1 (COLING '73), A.
Zampolli and N. Calzolari (Eds.), Vol. 1. Association for Computational
Linguistics, Stroudsburg, PA, USA, 183-194. DOI:
https://doi.org/10.3115/992532.992549

Therefore, it might make sense to test and expand the current set of
patterns by using a Turkish corpus. Here are some sources for such a
project.

List of Turkish words:
http://deniz.yuret.com/turkish/sozluk-boun.txt.gz?attredirects=0

Orwell's 1984 in Turkish:
https://github.com/bicici/SMTData/blob/master/1984_en-tr_SentenceAligned_ParallelCorpus.zip

One would hyphenate every word using the patterns as well as using the
rules in the TDK link above, compile a list of words that differ and update
the patterns as a result. This could be a project for those who work on
natural language processing for Turkish. (To future Turkish pattern authors
who read this thread: the TeX rules for hyphenation are talked about in
detail at Frank Liang's thesis, and in a shorter, concise form at Appendix
H of the TeXbook by Knuth.)

I would be interested in the opinions of pattern authors for adding the
rules for apostrophe and em dash into the Turkish pattern list. If anyone
has parsed a corpus to generate a pattern list, it would be great if they
could comment on whether what I described above is sufficient, or if there
are other points that one needs to consider.

Best wishes,
Ş. Ekin Kocabaş




Original thread below:

http://tug.org/pipermail/tex-hyphen/2015-October/001295.html

07.10.2015, 01:41, "Alex Kapranoff" <kappa at yandex.com>:
> Hello.
> Turkish hyphenation patterns are generated by a simple Ruby script
> available at
>
> http://www.ctan.org/tex-archive/language/hyph-utf8/source/generic/hyph-utf8/languages/tr
> which has this article by Pierre MacKay as its original source:
> http://www.tug.org/TUGboat/Articles/tb09-1/tb20mackay.pdf
> A curious special case is mentioned by professor MacKay and then copied
> through all the incarnations of the algorithm -- that is, the pattern
> "2e2cek." which is supposed to
> prevent splitting the "-ecek" suffix at the very end of a word. MacKay
> writes: "...nor do we really want the cek of -ecek broken off if it is at
> the end of a word" and then "The pattern
> 2e2cek. is added as a special case".
> I am not a native Turkish speaker although my level is high enough to
> notice the omission. In Turkish, many suffixes have variants to satisfy
> vowel and consonant harmony requirements.
> The other variant of "-ecek" is "-acak" which is used in words with wide
> (or back) vowels and there is no sense in adding "2e2cek." without also
> adding "2a2cak.".
> I took the liberty to Cc: S. Ekin Kocabas and H. Turgut Uyar who might not
> be on the list to maybe help and clarify the issue. They are both mentioned
> as people who participated in development of the current version of Turkish
> patterns. Unfortunately, Pierre MacKay passed away earlier this year so
> there is no way to know the original reason of this tiny little
> inconsistency.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/tex-hyphen/attachments/20170923/0871faf4/attachment.html>


More information about the tex-hyphen mailing list