[l2h] Confused about Unicode support

Ross MOORE Ross MOORE <ross@ics.mq.edu.au>
Sat, 3 Jul 1999 13:53:53 +1000 (EST)


[Charset iso-8859-2 unsupported, filtering to ASCII...]
> Hi,
> 
> Ross MOORE wrote:
> 
> > Ahh. OK, then the patch is not adequate.
> > Change it to:
> > 
> > $PREV_CHARSET= $CHARSET;
> >  require("$LATEX2HTMLVERSIONS${dd}latin1.pl");
> > $CHARSET=$PREV_CHARSET;
> > 
> 
> This works well, thank you very much.

Good. If you are convinced that  ...,latin2,unicode  
works correctly, and is readable in browsers,
then that is what you should be using.

The HTML page includes  <META ContentType="text/html;charset=utf-8">
which is what allows Netscape to get it right.


 
> > If you don't specify  ,unicode  but just  latin2  then your 8-bit
> > characters remain that way; however then \k{E} etc. become images.
> 
> Err..., that's true with latin1, the result with latin2 looks like this:
> 
> LaTeX source:
> \k{a} \'c \k{e} \l{} \'n \'o \'s \'z \.z \\
> \k{A} \'C \k{E} \L{} \'N \'O \'S \'Z \.Z
> 
> and l2h HTML output:
> &#177; &#230; &#234; &#179; &#241; &#243; &#182; &#188; &#191; <BR>
> &#161; &#198; &#202; &#163; &#209; &#211; &#166; &#172; &#175; 

OK. My mistake; these use the  &iso_map  subroutine,
defined in  latex2html .
This does the following:
 1. creates the entity name  e.g.  Aogon
 2. tries to find this in the current $CHARSET and gets the &#<num>;
 3. if 2. fails, then makes an image *provided* $ACCENT_IMAGES is 
	not empty --- it should contain the style to use; e.g. 'textrm'
 4. if 3. also fails, just omits the character entirely

Both steps 3, 4 emit WARNINGS messages, printed at the end,
so you'll know what happened.


As Alan pointed out, this used to work in older browsers.
If newer versions have fixed it, then LaTeX2HTML should change too.


It looks to me as though step 2 is wrong.
Perhaps the entity should be searched for in just iso-8859-1
and/or iso-10646 listings ?
That is an easy-enough change to make.

Another (perhaps better) possibility is to:

 1.  look first in iso-8859-1 ; if found, use  &#<num>;
 2.  look in $CHARSET ; use  \<octal-num> if found
	unless $CHARSET =~/unicode|utf/;
 3.  use &#<bignum>;  when appropriate. 
 4.  use an image, if nothing else works


Send me an example file for testing, and I'll implement this scheme.
Include both raw 8-bit characters and TeX accents.
If possible, also send a URL to a page that shows what you think
the results should look like.


> so again they are &#<latin2_code> and are not displayed correctly
> (tested with
> Netscape, Opera and hm... explorer). I have no way of checking it now,
> but I 
> still think that those should be Unicode numbers (regardles selected
> charset),
> at least then they are displayed correctly.

Are they ? 
My tests reveal this, only when  utf-8  is given as the charset.
But then, I don't have a fully set of fonts for all the possible
encodings, on different platforms with different browsers and versions...

... which makes proper testing rather difficult.

> > What I'd like is for native-speakers to complete these modules.
> > Similarly, for translating back into specific 8-bit encodings,
> > that work should be done by someone with the need for it,
> > and the ability to do adequate testing.
> 
> I may, at least, give it a try, tell me more.
> I've played with latin2.pl file. It looks like translation is 
> based on %iso_8859_2_character_map, eg. by changing '&#161;' (next to
> 'Aogon')
> to '&#260;' or even latin2 8-bit character I was able to get the needed
> entity or character in HTML. 

Yes, but that will ruin the conversion of raw 8-bit characters
to the correct &#<bignumber>;  for unicode/utf-8 .

The logic of the transformation, after the entity name has been
constructed, is described above;
where I loosely used $CHARSET for   $CHARSET_character_map
 (with - converted to _ )


> And what is %iso_8859_2_character_map_inv for?

This is needed when the particular portion of text ends up
being required for an image; e.g. within a {figure} or {makeimage}
or other unknown environment.
Then we must recover the LaTeX source, else image-creation
will fail.

> And how \k{A} is translated into 'Aogon'?

Look at  sub generate_accent_commands  in the  latex2html script.
This creates further subroutines:
  do_cmd_k  do_cmd_b  do_cmd_d   etc.

(This is why you get redefinition warnings, if you try to
define commands like:  \newcommand{\b}{\beta} .)


Control sequences such as \' \` \^ etc.  get translated to 
 \acute  \grave  \circ  when  &normalize  is called,
on a chunk of the input-source. Later, when the main translation
is done within the prevailing environment context,
then the subroutines setup by  &generate_accent_commands  are used. 


> > 
> > > PS:In manual, page 15 - I think that there should be
> > > $TITLES_LANGUAGE = 'french'; rather then $LANGUAGE_TITLES = ...
> > 
> > Not sure, without checking.
Oops, yes that is an error.

> 
> Mariusz Pietrzak
> mariuszp@polbox.pl



Hope this helps clarify what LaTeX2HTML is doing.

Regards,

	Ross Moore