[XeTeX] Hyphenated, transliterated Sanskrit.

Arthur Reutenauer arthur.reutenauer at normalesup.org
Mon Nov 22 15:16:41 CET 2010

> 2) That might be a stupid question, but aren't hyphennation patterns
> for most Abugida-scripts more or less the same?

  Yes, more or less.  If you check the actual files you'll see that
there are some differences between languages that use the same script.
There's not much you can do with that, since TeX can only read one list
of patterns per language.  It's in particular not possible, from within
a TeX document, to create a modified hyphenation trie by deleting or
inserting from an existing trie.  You need a different language.  (And
you need to load the patterns in ini mode anyway.)

  You could also imagine to have a master file for each Indic script
that would contain the patterns that are needed for all the languages
written using that script, and a separate file with additional patterns
for each individual languages; but that seems hardly worth the effort,
for the reason below.

>                                                                   Lots
> of hyphennation patterns have to be duplicated, if they are ordered by
> language. While one could have a hyphen-indic.tex instead.

  You will need a separate file for Sanskrit anyway, since it can be
written in many different scripts, and there is not yet a mechanism to
switch patterns when switching scripts (it's tied to a language).
Hence, you're left with the modern Indic languages.  Among those for
which we have patterns, there happens to be only two pairs that are
written in the same script: Hindi and Marathi (in Devanagari), and
Bengali and Assamese (in Bengali); both of which containing less than
100 patterns.  It does not seem worth the trouble (although those two
pairs are actually exactly identical, so that we could have the same
file, thereby saving almost 4 kilobytes in TeX distributions; but I
wouldn't know how to name the two common files anyway...)

  In fact, since the pattern files we have for the different Indic
languages basically list all the Unicode characters relevant for their
script, plus a few consonant clusters, they all contain about 100
patterns and take up less than 2 kilobytes; apart of course for
Sanskrit, for which we have patterns in half a dozen Indic scripts, plus
transliteration in Latin (~800 patterns, < 10kb).  Balance that with the
three different files for German (reformed spelling, old spelling, old
spelling in Switzerland) that have each 14000+ patterns and weigh almost
100kb; Norwegian (27000 patterns, ~200kb); and finally Hungarian (>60000
patterns, >500kb); and you'll see why I'm not eager to develop a
complicated scheme in order to share information between hyphenation
patterns that are "more or less" the same.


