[XeTeX] Bangla font question

maxwell maxwell at umiacs.umd.edu
Fri Mar 11 00:59:51 CET 2011


We're publishing a grammar of Bangla, which uses the Bengali script block
of Unicode.  We're running into a problem with the appearance of certain
vowel characters, which are supposed to appear to the *left* of the
consonant that they're pronounced after.  These include U+09BF, U+09C7 and
U+09C8.  (U+09CB and U+09CC are similar.)  (Those of you who studied
transformational grammar may be reminded of the "affix hop
transformation.") 

Normally this works just fine.  The display rules are somewhat complex,
because the Bangla writing system is one of those that has a default vowel.
Specifically, a consonant letter which is not followed by an overt vowel
sign in the writing is assumed to be followed by the default vowel in
speech.  If a consonant is *not* followed by a vowel in speech, i.e. if it
is followed by another consonant (i.e. it's the first consonant in a
consonant cluster), then you're supposed to put a special virama (or
hashanta) mark under the consonant--a diacritic to indicate that there's no
vowel following.

When a consonant + virama appears at the end of the word, the virama would
appear overtly.  In the rendering of Unicode text, a consonant + virama +
consonant is often replaced on-screen or in print by a conjunct consonant,
which is a kind of double consonant (analogous to English x = ks, but often
composed of pieces of the two consonant characters in Bangla).  Not all
fonts have all conjunct consonants, and when a font lacks a particular
conjunct, the expected representation on-screen or in print is generally
the underlying representation, i.e. consonant + virama + consonant.

There is one exception to the contraction of consonant + virama +
consonant into conjunct consonant, and that is when there's a morpheme
boundary between the two consonants (i.e. the first consonant is in the
stem, and the second consonant is in a suffix). In this case, the expected
appearance on-screen or in print would be consonant + virama + consonant,
i.e. what you'd get if the font didn't have a conjunct consonant. In order
to force this behavior, Unicode uses a ZWNJ (Zero Width Non-Joiner); the
underlying sequence
   consonant + virama + ZWNJ + consonant
is output as
   consonant + virama + consonant
rather than as a two-consonant conjunct.

If one of these vowels that hops leftward (U+09BF, U+09C7 and U+09C8) is
preceded by a conjunct consonant (underlyingly a sequence of consonant +
virama + consonant), then the vowel hops leftward over the conjunct.

So far, so good.

However, a problem arises when consonant clusters occur across morpheme
boundaries *and* the second consonant is followed by one of the vowel signs
that is supposed to appear to the left of the consonant it's pronounced
after.  In this case, we're told that the vowel sign should appear
*between* the two consonants, rather than to the left of both consonants. 
In other words, the underlying sequence
   consonant + virama + ZWNJ + consonant + vowel
should render as
   consonant + virama + vowel + consonant
when the vowel in question is one of those that shows up to the left. 
(The ZWNJ of course doesn't appear in print.) But instead, we get
   vowel + consonant + virama + consonant
which is said to be more or less un-readable.

I've tried numerous combinations of characters to get this to work, to no
avail.  The one which perhaps came the closest was to use an optional
hyphen (U+00AD) after the virama.  This prevented the vowel from moving too
far left--unfortunately, the Bangla font we're using doesn't have this
character, so the optional hyphen showed up as a box (indicating a missing
character in the font). I've also tried include Zero Width Space (U+200B),
which was simply ignored (perhaps by XeTeX?).

Suggestions? Is there a way in XeTeX to prevent the vowel sign from
hopping over a ZWNJ?  Or is the problem in the font?  That wouldn't be
surprising, since as I say the virama is usually omitted in text written
for native speakers, so this problem seldom comes up.  We're writing it in
our grammar for the edification of non-native speakers.

   Mike Maxwell


More information about the XeTeX mailing list