[l2h] The L2H 2002 Cannot deal CJK document correctly!

Ross Moore ross@ics.mq.edu.au
Wed, 24 Apr 2002 21:42:23 +1000 (EST)


> On Wed, Apr 24, 2002, Ross Moore wrote:
> > OK; I've got it, and can reproduce the problem.
> > 
> > The fix is easy, but first a question.
> > You example HTML files correctly have  charset = text/big5 .
> > Where is this done in the processing, or do you do it yourself
> > after LaTeX2HTML has finished ?
> 
>   It's because(if you don't mention this, i almost forget it.:),
>   I have ~/.latex2html-init,
> 
> $ADDRESS = "<I>Compiled by Edward G.J. Lee ($address_data[1])</I>";
> $default_language = 'taiwanese';
> $TITLES_LANGUAGE = "taiwanese";
> $charset = "big5";
> $BOTTOM_NAVIGATION = 1;

Ahah; there's the culprit.
 
>   So, I didn't do anything after executing ``latex2html''. The
>   taiwanese is just for testing only.
> 
> > By simply inserting 2 lines into  CJK.perl  the problem
> > is fixed, and this charset is set automatically:
> > 
> > 
> > 	package main;
> > 
> > 	$charset = 'big5'; 	## insert these 2 lines
> > 	$CHARSET = 'big5';	##
> > 
> > 
> > This should be sufficient for documents have just Big5 characters.
> > 
> > Please advise if you have example documents where this is not sufficient.
> 
>   Thanks, but I guess to config rc file maybe more convenient,
>   cause sometimes we might write an utf-8 or other charset HTML.

Yes. Werner pointed out the same problem.

I'm going to update the LaTeX2HTML repository with the following
patch to  CJK.perl :

landau.ics.mq.edu.au> cvs diff CJK.perl
Index: CJK.perl
===================================================================
RCS file: /home/latex2ht/cvs/latex2html/user/styles/CJK.perl,v
retrieving revision 1.5
diff -r1.5 CJK.perl
82a83,106
> # possible values for the 1st optional argument to \begin{CJK}
> # and the corresponding charset:
> 
> %CJK_charset = (
>         'Bg5'    , 'big5'
>       , 'Bg5+'   , 'big5+'
>       , 'GB'     , 'gb_2312'
>       , 'GBt'    , 'gbt_12345'
>       , 'GBK'    , 'gbk'
>       , 'JIS'    , 'jisx_0208'
>       , 'SJIS'   , 'sjis'
>       , 'KS'     , 'ks_1001'
>       , 'UTF8'   , 'utf8'
>       , 'EUC-TW' , 'euc-tw'
>       , 'EUC-JP' , 'euc-jp'
> );
> 
> # Use 'Bg5' => 'big5' as default charset, for both input and output,
> # unless it is set already with a value for  $CJK_AUTO_CHARSET
> 
> $CJK_AUTO_CHARSET = '' unless (defined $CJK_AUTO_CHARSET);
> $charset = $CHARSET = $CJK_AUTO_CHARSET || $CJK_charset{'Bg5'};
> 
> 
118c142,155
<     &get_next_optional_argument;
---
>     my ($cjk_enc) = &get_next_optional_argument;
>     $cjk_enc =~ s/^\s+|\s+$//g;
>     if ($cjk_enc) {
>       if (!defined $CJK_charset{$cjk_enc}) {
>           &write_warning ( "unknown charset code: $cjk_enc in CJK environment.");
>       } elsif (!$CJK_AUTO_CHARSET) {
>           $CJK_AUTO_CHARSET = $charset = $CHARSET = $CJK_charset{$cjk_enc};
>       } elsif ($CHARSET eq $CJK_charset{$cjk_enc}) {
>           # compatible; do nothing.
>       } else {
>           &write_warning ( "Only one charset allowed per document: $CHARSET");
>           &write_warning ( "Ignoring request for ".$CJK_charset{$cjk_enc});
>       }
>     }


Please advise ASAP if there is anything here that you think is incorrect
or inadequate.

Note how there is now a variable  $CJK_AUTO_CHARSET  which can be set in an
initialisation file. If it is not set, then the first  {CJK} or {CJK*}
environment that has an encoding argument will change the encoding from
the global default of  'big5'.


Please apply the patch, and report any problems.


All the best,

	Ross Moore

 
> > The reason for the errors, without these charset settings, was because
> > some 8-bit characters were being translated back to TeX accents, or
> > to macros for mathematical symbols, according to the latin-1 use of those
> > characters. This is clearly inappropriate for a CJK document.
> > 
> > 
> > Hope this helps,
> > 
> > 	Ross Moore
> 
>   I see, thanks for the clear explanations.


You're welcome.
Thanks for making me look at CJK.perl .
Until today, I'd never studied that package.  :-)
 
> 
> Rgds,
> Edward G.J. Lee