[l2h] The L2H 2002 Cannot deal CJK document correctly!
Ross Moore
ross@ics.mq.edu.au
Wed, 24 Apr 2002 21:42:23 +1000 (EST)
> On Wed, Apr 24, 2002, Ross Moore wrote:
> > OK; I've got it, and can reproduce the problem.
> >
> > The fix is easy, but first a question.
> > You example HTML files correctly have charset = text/big5 .
> > Where is this done in the processing, or do you do it yourself
> > after LaTeX2HTML has finished ?
>
> It's because(if you don't mention this, i almost forget it.:),
> I have ~/.latex2html-init,
>
> $ADDRESS = "<I>Compiled by Edward G.J. Lee ($address_data[1])</I>";
> $default_language = 'taiwanese';
> $TITLES_LANGUAGE = "taiwanese";
> $charset = "big5";
> $BOTTOM_NAVIGATION = 1;
Ahah; there's the culprit.
> So, I didn't do anything after executing ``latex2html''. The
> taiwanese is just for testing only.
>
> > By simply inserting 2 lines into CJK.perl the problem
> > is fixed, and this charset is set automatically:
> >
> >
> > package main;
> >
> > $charset = 'big5'; ## insert these 2 lines
> > $CHARSET = 'big5'; ##
> >
> >
> > This should be sufficient for documents have just Big5 characters.
> >
> > Please advise if you have example documents where this is not sufficient.
>
> Thanks, but I guess to config rc file maybe more convenient,
> cause sometimes we might write an utf-8 or other charset HTML.
Yes. Werner pointed out the same problem.
I'm going to update the LaTeX2HTML repository with the following
patch to CJK.perl :
landau.ics.mq.edu.au> cvs diff CJK.perl
Index: CJK.perl
===================================================================
RCS file: /home/latex2ht/cvs/latex2html/user/styles/CJK.perl,v
retrieving revision 1.5
diff -r1.5 CJK.perl
82a83,106
> # possible values for the 1st optional argument to \begin{CJK}
> # and the corresponding charset:
>
> %CJK_charset = (
> 'Bg5' , 'big5'
> , 'Bg5+' , 'big5+'
> , 'GB' , 'gb_2312'
> , 'GBt' , 'gbt_12345'
> , 'GBK' , 'gbk'
> , 'JIS' , 'jisx_0208'
> , 'SJIS' , 'sjis'
> , 'KS' , 'ks_1001'
> , 'UTF8' , 'utf8'
> , 'EUC-TW' , 'euc-tw'
> , 'EUC-JP' , 'euc-jp'
> );
>
> # Use 'Bg5' => 'big5' as default charset, for both input and output,
> # unless it is set already with a value for $CJK_AUTO_CHARSET
>
> $CJK_AUTO_CHARSET = '' unless (defined $CJK_AUTO_CHARSET);
> $charset = $CHARSET = $CJK_AUTO_CHARSET || $CJK_charset{'Bg5'};
>
>
118c142,155
< &get_next_optional_argument;
---
> my ($cjk_enc) = &get_next_optional_argument;
> $cjk_enc =~ s/^\s+|\s+$//g;
> if ($cjk_enc) {
> if (!defined $CJK_charset{$cjk_enc}) {
> &write_warning ( "unknown charset code: $cjk_enc in CJK environment.");
> } elsif (!$CJK_AUTO_CHARSET) {
> $CJK_AUTO_CHARSET = $charset = $CHARSET = $CJK_charset{$cjk_enc};
> } elsif ($CHARSET eq $CJK_charset{$cjk_enc}) {
> # compatible; do nothing.
> } else {
> &write_warning ( "Only one charset allowed per document: $CHARSET");
> &write_warning ( "Ignoring request for ".$CJK_charset{$cjk_enc});
> }
> }
Please advise ASAP if there is anything here that you think is incorrect
or inadequate.
Note how there is now a variable $CJK_AUTO_CHARSET which can be set in an
initialisation file. If it is not set, then the first {CJK} or {CJK*}
environment that has an encoding argument will change the encoding from
the global default of 'big5'.
Please apply the patch, and report any problems.
All the best,
Ross Moore
> > The reason for the errors, without these charset settings, was because
> > some 8-bit characters were being translated back to TeX accents, or
> > to macros for mathematical symbols, according to the latin-1 use of those
> > characters. This is clearly inappropriate for a CJK document.
> >
> >
> > Hope this helps,
> >
> > Ross Moore
>
> I see, thanks for the clear explanations.
You're welcome.
Thanks for making me look at CJK.perl .
Until today, I'd never studied that package. :-)
>
> Rgds,
> Edward G.J. Lee