[l2h] Confused about Unicode support

Ross MOORE Ross MOORE <ross@ics.mq.edu.au>
Thu, 1 Jul 1999 09:41:22 +1000 (EST)

> Hi,
> Ross MOORE wrote:
> > Hmm. It certainly works correctly if you use \L and \l
> > for the Polish L characters; so I presume that you are using
> > upper-plane (129-255) characters directly in the source, yes ?
> Yes
> > OK, I think I see what is causing the problem.
> > In the file  ...../versions/unicode.pl
> > there is a line near the top:
> > 
> >         require("$LATEX2HTMLVERSIONS${dd}latin1.pl");
> > 
> > Change this to read:
> > 
> > require("$LATEX2HTMLVERSIONS${dd}latin1.pl") if ($CHARSET =~/iso\-8859\-1/);
> > 
> Thanks, the patch works, but ... 
> how about generating "polish" characters without using 8-bit font,
> (and without using images), by using standard commands:
> \k{a} \'c \k{e} \l{} \'n \'o \'s \'z \.z 
> \k{A} \'C \k{E} \L{} \'N \'O \'S \'Z \.Z 
> This worked with "-html_version 3.2,latin2,unicode" switch.
> Now (after the above patch) it works except \'o and \'O (l2h can't 
> convert them into available encodings - is it OK? before the patch it
> could). 

Ahh. OK, then the patch is not adequate.
Change it to:


Now you'll get all the TeX definitions again,
as with the default latin1,
but the translation table will take the
raw latin2-encoded characters to their unicode points.

> And when using Latin2 output ("-html_version 3.2,latin2"), the
> characters generated 
> as above appear as &#<latin2_number>(only \'o and \'O appears as 
> regular characters) thus, at lest my Netscape, can't disply them

One day Netscape will have full font support for all of Unicode.
There is probably a way to do it already, by defining mapping tables
into fonts on your system; however I don't know how to do this,
and it is probably a bit different on different platforms.
Any advice on this would be much appreciated.

> correctly - I think that &# requires unicode number (regardless
> selected charset), and maybe in future it would be possible to
> generate 8-bit characters rather then entities.
I haven't attempted to do this, mainly through lack of means
to do adequate testing, and also because  latin1  is the official
charset of HTML3.2  and  Unicode  is the charset for  HTML4.0 .

If you don't specify  ,unicode  but just  latin2  then your 8-bit
characters remain that way; however then \k{E} etc. become images.

I did some of the work needed to recognise other input encodings,
and convert to Unicode; however it isn't complete (e.g. for Greek,
Arabic, Hebrew, etc.) 
What I'd like is for native-speakers to complete these modules.
Similarly, for translating back into specific 8-bit encodings,
that work should be done by someone with the need for it,
and the ability to do adequate testing.

> And one more question:
> Is there a difference between \usepackage[latin2]{inputenc}
> and setting latin2 using $CHARSET and $HTML_VERSION, which
> one is a better way.

Things happen at slightly different times in the processing,
but the final result is supposed to be equivalent.

> PS:In manual, page 15 - I think that there should be
> $TITLES_LANGUAGE = 'french'; rather then $LANGUAGE_TITLES = ...

Not sure, without checking.
I think either works, with one variable inheriting the value
of the other if only one has a value.
If *both* are defined, I'd need to check which one wins.

Hope this helps,

	Ross Moore