[XeTeX] Devanagari ASCII to Unicode mapping

ShreeDevi Kumar shreeshrii at gmail.com
Sun Feb 18 10:10:35 CET 2018


Thank you for this info.

There is still a lot of content in Hindi being generated in non-Unicode
fonts (lot of DTP software being used in India still does not support
Unicode).

>> The LDC *might* still have the encoding converters laying around
somewhere.

These will be very useful, if they can be made available. There is a need
for easily converting legacy documents to Unicode. One of the applications
for which someone was looking for these recently was for checking for
plagiarism in student projects/thesis.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Feb 17, 2018 at 10:45 PM, Mike Maxwell <maxwell at umiacs.umd.edu>
wrote:

> On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote:
>
>> Before unicode, devanagari fonts used the ASCII range (legacy fonts) -
>> however AFAIK there is no standardization in the mapping, though various
>> families of fonts had similar mapping.
>>
>> see http://hindi-fonts.com/tools for converters from different mappings
>> to unicode.
>>
>> So,  ASCII to Unicode mapping for Devanagari will change based on the
>> font used.
>>
>
> Indeed!  In 2003, DARPA held a "surprise language exercise", the goal of
> which was to produce (very basic) MT etc. tools for Hindi, in a month's
> time.  I had been involved in the prep for it to ensure that there would be
> no roadblocks (at the time, I was working at the LDC).  One of the things
> that Bill Poser and I verified was that there was a Unicode encoding for
> Hindi/Devanagari.  There was, but that was the wrong question.
>
> The right question was whether any Hindi website used Unicode.  The answer
> to that was that the BBC and Colgate did, but hardly anyone else.  A few
> Indian government sites used ISCII, which wouldn't have been bad, but most
> places used proprietary encodings that went along with a proprietary font.
> Worse, these were not simple code-point-to-character encodings; it was as
> if the Latin letter 'l' had been encoded as 'l', but then 'd' had been
> encoded as 'c' + 'l', 'b' as 'l' + a sort of backwards 'c', 'p' as a
> lowered 'l' _ the backwards 'c', etc.  It was a mess, and for awhile it was
> unclear whether the exercise would fail because most of the data we needed
> was in these weird proprietary encodings.  (It eventually succeeded.)
>
> There are some notes here--
>
> http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html
> --that Mark Liberman of the LDC made at the time concerning some of the
> issues.  Most of it is long out of date (and the links are probably
> broken), and these proprietary encodings have thankfully been replaced by
> Unicode; but if you're dealing with documents from that era, you might
> still run into them.  The LDC *might* still have the encoding converters
> laying around somewhere.
> --
>    Mike Maxwell
>    "My definition of an interesting universe is
>    one that has the capacity to study itself."
>          --Stephen Eastmond
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20180218/38802128/attachment-0001.html>


More information about the XeTeX mailing list