[l2h] Confused about Unicode support
Ross MOORE
Ross MOORE <ross@ics.mq.edu.au>
Sat, 3 Jul 1999 13:53:53 +1000 (EST)
[Charset iso-8859-2 unsupported, filtering to ASCII...]
> Hi,
>
> Ross MOORE wrote:
>
> > Ahh. OK, then the patch is not adequate.
> > Change it to:
> >
> > $PREV_CHARSET= $CHARSET;
> > require("$LATEX2HTMLVERSIONS${dd}latin1.pl");
> > $CHARSET=$PREV_CHARSET;
> >
>
> This works well, thank you very much.
Good. If you are convinced that ...,latin2,unicode
works correctly, and is readable in browsers,
then that is what you should be using.
The HTML page includes <META ContentType="text/html;charset=utf-8">
which is what allows Netscape to get it right.
> > If you don't specify ,unicode but just latin2 then your 8-bit
> > characters remain that way; however then \k{E} etc. become images.
>
> Err..., that's true with latin1, the result with latin2 looks like this:
>
> LaTeX source:
> \k{a} \'c \k{e} \l{} \'n \'o \'s \'z \.z \\
> \k{A} \'C \k{E} \L{} \'N \'O \'S \'Z \.Z
>
> and l2h HTML output:
> ± æ ê ³ ñ ó ¶ ¼ ¿ <BR>
> ¡ Æ Ê £ Ñ Ó ¦ ¬ ¯
OK. My mistake; these use the &iso_map subroutine,
defined in latex2html .
This does the following:
1. creates the entity name e.g. Aogon
2. tries to find this in the current $CHARSET and gets the &#<num>;
3. if 2. fails, then makes an image *provided* $ACCENT_IMAGES is
not empty --- it should contain the style to use; e.g. 'textrm'
4. if 3. also fails, just omits the character entirely
Both steps 3, 4 emit WARNINGS messages, printed at the end,
so you'll know what happened.
As Alan pointed out, this used to work in older browsers.
If newer versions have fixed it, then LaTeX2HTML should change too.
It looks to me as though step 2 is wrong.
Perhaps the entity should be searched for in just iso-8859-1
and/or iso-10646 listings ?
That is an easy-enough change to make.
Another (perhaps better) possibility is to:
1. look first in iso-8859-1 ; if found, use &#<num>;
2. look in $CHARSET ; use \<octal-num> if found
unless $CHARSET =~/unicode|utf/;
3. use &#<bignum>; when appropriate.
4. use an image, if nothing else works
Send me an example file for testing, and I'll implement this scheme.
Include both raw 8-bit characters and TeX accents.
If possible, also send a URL to a page that shows what you think
the results should look like.
> so again they are &#<latin2_code> and are not displayed correctly
> (tested with
> Netscape, Opera and hm... explorer). I have no way of checking it now,
> but I
> still think that those should be Unicode numbers (regardles selected
> charset),
> at least then they are displayed correctly.
Are they ?
My tests reveal this, only when utf-8 is given as the charset.
But then, I don't have a fully set of fonts for all the possible
encodings, on different platforms with different browsers and versions...
... which makes proper testing rather difficult.
> > What I'd like is for native-speakers to complete these modules.
> > Similarly, for translating back into specific 8-bit encodings,
> > that work should be done by someone with the need for it,
> > and the ability to do adequate testing.
>
> I may, at least, give it a try, tell me more.
> I've played with latin2.pl file. It looks like translation is
> based on %iso_8859_2_character_map, eg. by changing '¡' (next to
> 'Aogon')
> to 'Ą' or even latin2 8-bit character I was able to get the needed
> entity or character in HTML.
Yes, but that will ruin the conversion of raw 8-bit characters
to the correct &#<bignumber>; for unicode/utf-8 .
The logic of the transformation, after the entity name has been
constructed, is described above;
where I loosely used $CHARSET for $CHARSET_character_map
(with - converted to _ )
> And what is %iso_8859_2_character_map_inv for?
This is needed when the particular portion of text ends up
being required for an image; e.g. within a {figure} or {makeimage}
or other unknown environment.
Then we must recover the LaTeX source, else image-creation
will fail.
> And how \k{A} is translated into 'Aogon'?
Look at sub generate_accent_commands in the latex2html script.
This creates further subroutines:
do_cmd_k do_cmd_b do_cmd_d etc.
(This is why you get redefinition warnings, if you try to
define commands like: \newcommand{\b}{\beta} .)
Control sequences such as \' \` \^ etc. get translated to
\acute \grave \circ when &normalize is called,
on a chunk of the input-source. Later, when the main translation
is done within the prevailing environment context,
then the subroutines setup by &generate_accent_commands are used.
> >
> > > PS:In manual, page 15 - I think that there should be
> > > $TITLES_LANGUAGE = 'french'; rather then $LANGUAGE_TITLES = ...
> >
> > Not sure, without checking.
Oops, yes that is an error.
>
> Mariusz Pietrzak
> mariuszp@polbox.pl
Hope this helps clarify what LaTeX2HTML is doing.
Regards,
Ross Moore