[XeTeX] Devanagari ASCII to Unicode mapping
Mike Maxwell
maxwell at umiacs.umd.edu
Sat Feb 17 18:15:06 CET 2018
On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote:
> Before unicode, devanagari fonts used the ASCII range (legacy fonts) -
> however AFAIK there is no standardization in the mapping, though various
> families of fonts had similar mapping.
>
> see http://hindi-fonts.com/tools for converters from different mappings
> to unicode.
>
> So, ASCII to Unicode mapping for Devanagari will change based on the
> font used.
Indeed! In 2003, DARPA held a "surprise language exercise", the goal of
which was to produce (very basic) MT etc. tools for Hindi, in a month's
time. I had been involved in the prep for it to ensure that there would
be no roadblocks (at the time, I was working at the LDC). One of the
things that Bill Poser and I verified was that there was a Unicode
encoding for Hindi/Devanagari. There was, but that was the wrong
question.
The right question was whether any Hindi website used Unicode. The
answer to that was that the BBC and Colgate did, but hardly anyone else.
A few Indian government sites used ISCII, which wouldn't have been
bad, but most places used proprietary encodings that went along with a
proprietary font. Worse, these were not simple code-point-to-character
encodings; it was as if the Latin letter 'l' had been encoded as 'l',
but then 'd' had been encoded as 'c' + 'l', 'b' as 'l' + a sort of
backwards 'c', 'p' as a lowered 'l' _ the backwards 'c', etc. It was a
mess, and for awhile it was unclear whether the exercise would fail
because most of the data we needed was in these weird proprietary
encodings. (It eventually succeeded.)
There are some notes here--
http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html
--that Mark Liberman of the LDC made at the time concerning some of the
issues. Most of it is long out of date (and the links are probably
broken), and these proprietary encodings have thankfully been replaced
by Unicode; but if you're dealing with documents from that era, you
might still run into them. The LDC *might* still have the encoding
converters laying around somewhere.
--
Mike Maxwell
"My definition of an interesting universe is
one that has the capacity to study itself."
--Stephen Eastmond
More information about the XeTeX
mailing list