[XeTeX] Devanagari ASCII to Unicode mapping

Mike Maxwell maxwell at umiacs.umd.edu
Sat Feb 17 18:15:06 CET 2018


On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote:
> Before unicode, devanagari fonts used the ASCII range (legacy fonts) - 
> however AFAIK there is no standardization in the mapping, though various 
> families of fonts had similar mapping.
> 
> see http://hindi-fonts.com/tools for converters from different mappings 
> to unicode.
> 
> So,  ASCII to Unicode mapping for Devanagari will change based on the 
> font used.

Indeed!  In 2003, DARPA held a "surprise language exercise", the goal of 
which was to produce (very basic) MT etc. tools for Hindi, in a month's 
time.  I had been involved in the prep for it to ensure that there would 
be no roadblocks (at the time, I was working at the LDC).  One of the 
things that Bill Poser and I verified was that there was a Unicode 
encoding for Hindi/Devanagari.  There was, but that was the wrong 
question.

The right question was whether any Hindi website used Unicode.  The 
answer to that was that the BBC and Colgate did, but hardly anyone else. 
  A few Indian government sites used ISCII, which wouldn't have been 
bad, but most places used proprietary encodings that went along with a 
proprietary font.  Worse, these were not simple code-point-to-character 
encodings; it was as if the Latin letter 'l' had been encoded as 'l', 
but then 'd' had been encoded as 'c' + 'l', 'b' as 'l' + a sort of 
backwards 'c', 'p' as a lowered 'l' _ the backwards 'c', etc.  It was a 
mess, and for awhile it was unclear whether the exercise would fail 
because most of the data we needed was in these weird proprietary 
encodings.  (It eventually succeeded.)

There are some notes here--
 
http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html
--that Mark Liberman of the LDC made at the time concerning some of the 
issues.  Most of it is long out of date (and the links are probably 
broken), and these proprietary encodings have thankfully been replaced 
by Unicode; but if you're dealing with documents from that era, you 
might still run into them.  The LDC *might* still have the encoding 
converters laying around somewhere.
-- 
    Mike Maxwell
    "My definition of an interesting universe is
    one that has the capacity to study itself."
          --Stephen Eastmond


More information about the XeTeX mailing list