[XeTeX] traditional to simplified Chinese character conversion utility or data base

Thu Oct 20 23:29:45 CEST 2011

I seem to have a working solution now. Yesterday I wrote a c program
to convert the Unihan_variants.txt file (suggested by Arthur) to an
ascii TECkit (suggested by Zdenek) map, then used TECkit's
teckit_compile utility to convert that to a binary map, and then used
TECkit's txtconv utility (also suggested by Zdenek) to map the
traditional characters to simplified. The map files contain 12,730
unicode to unicode mapping relations each. More testing would
definitely be good (no guarantees at this point).

If anyone has interest, they can download this zip file:
  http://banyan.cm.nctu.edu.tw/~dgreenhoe/groups/var2map.zip

The zip file includes the c source code, makefile, mapping file, and
tec file, as well as a Windows executable. The included tec file is
based on the Unicode 6.1.0 standard. If a new standard becomes
available, var2map.exe and teckit_complile.exe can be run again to
update the binary mapping file.

Using make, you can change the directory paths in the makefile and enter
  "make all"
on the command line for a kind of demo. The demo maps some Latin and
traditional characters (in trad.tex) to Latin and simplified
characters (in simp.tex).

On Thu, Oct 20, 2011 at 11:47 PM, BPJ <bpj at melroch.se> wrote:
> I got the thought that this might be done at least approximatively by ...
>  $ grep 'kSimplifiedVariant' Unihan_Variants.txt \
>      |perl -ple's/kSimplifiedVariant/>/' >>tex-chi-sim-trad.map
> tex-text.map, plus some very little manual touching up
> of debris after a comment line in Unihan_Variants.txt and
> adding some descriptive comments.

It looks like this solution from BPJ does essentially the same thing
as the above mentioned c program. In addition, this solution by BPJ
has the additional benefit, because it is a perl script, of being
cross-platform without having to run a c compiler.

As a follow-up to Andy's suggestion of the Tong Wen code: I did look
into the code. I found what appears might be a good set of data bases
for the simplified to traditional conversion, but I didn't seem to
find a traditional to simplified solution. I did join a mailing list
for the project and posted a request for assistance, but so far have
not received any reply. Maybe the project has become dormant.

Thank you very much to everyone who gave me help on this --- Zdenek
for the TECnik suggestion, Andy for the Tong Wen suggestion, Arthur
for the Unihan_Variants suggestion, and BPJ for the perl suggestion. I
appreciate the help very much --- I don't know if I would have ever
arrived at a solution without it.

One of the next tasks is to find quality fonts (preferably OpenType)
for Simplified Chinese, including fonts with Ruby text  (Zhu-Yin or
Pin-Yin). If anyone has suggestions of useful font repositories,
please let me know. Thanks!

Dan

On Thu, Oct 20, 2011 at 11:47 PM, BPJ <bpj at melroch.se> wrote:
> I got the thought that this might be done at least
> approximatively by simply running the the following
> command in the terminal:
>
>  $ grep 'kSimplifiedVariant' Unihan_Variants.txt \
>      |perl -ple's/kSimplifiedVariant/>/' >>tex-chi-sim-trad.map
>
> where Unihan_Variants.txt is the file from the Unicode
> Unihan database and tex-chi-sim-trad.map is a copy of
> tex-text.map, plus some very little manual touching up
> of debris after a comment line in Unihan_Variants.txt and
> adding some descriptive comments. The results are attached.
>
> /bpj
>
> On 2011-10-20 00:44, Daniel Greenhoe wrote:
>>
>> Hi Arthur,
>>
>> On Thu, Oct 20, 2011 at 1:02 AM, Arthur Reutenauer
>> <arthur.reutenauer at normalesup.org>  wrote:
>>>
>>>  Unicode has that in the Unihan database:
>>>  look up Unihan_Variants.txt in Unihan.zip
>>> (latest version
>>> http://www.unicode.org/Public/6.1.0/ucd/Unihan-6.1.0d1.zip )
>>
>> It looks like I can extract everything I need from Unihan_Variants.txt.
>> Thank you so much for your help! I appreciate it very much.
>>
>> Dan
>>
>> On Thu, Oct 20, 2011 at 1:02 AM, Arthur Reutenauer
>> <arthur.reutenauer at normalesup.org>  wrote:
>>>
>>> On Tue, Oct 18, 2011 at 05:49:28AM +0800, Daniel Greenhoe wrote:
>>>>
>>>>                                     Does anyone know of any data base
>>>> with a traditional to simplified character mapping such that I could
>>>> maybe write the utility myself?
>>>
>>>  Unicode has that in the Unihan database: look up Unihan_Variants.txt
>>> in Unihan.zip (latest version
>>> http://www.unicode.org/Public/6.1.0/ucd/Unihan-6.1.0d1.zip )
>>>
>>>        Arthur
>>>
>>>
>>> --------------------------------------------------
>>> Subscriptions, Archive, and List information, etc.:
>>>  http://tug.org/mailman/listinfo/xetex
>>>
>>
>>
>>
>> --------------------------------------------------
>> Subscriptions, Archive, and List information, etc.:
>>   http://tug.org/mailman/listinfo/xetex
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>
>