[l2h] Confused about Unicode support

Alan J. Flavell Alan J. Flavell" <flavell@a5.ph.gla.ac.uk
Thu, 1 Jul 1999 12:58:13 +0100 (BST)


On Thu, 1 Jul 1999, Ross MOORE wrote:

> But let's confirm that.

I think the answer is "yes"; I'm confident I understand how HTML works
(and how HTML browsers are at least _supposed_ to work), but at some
points I'll have to take your word for it as to what latex2html
options are actually doing, as I realise my understanding is not
entirely clear.

> The character encoding is the encoding of the characters in the HTML
> file,

OK
[...]
> Stated loosely, this is the "input-encoding".
> When this is the value of $CHARSET then LaTeX2HTML does not generate
> parameter entities &#<number>;  with <number> larger than 255.

OK

> However, there is no attempt made to verify the validity of any
> parameter entities included within {rawhtml} environments.

Noted.

> When using command-line options such as:
> 
>     -html_version 4.0,.....,unicode,....
> 
> the situation is quite different:
> 
>   A. content="text/html; charset=utf8"  is put into the <META> tag;
>   B. the HTML file contains only ASCII characters

Right.  This corresponds to the "conservative" (Netscape-compatible)
recommendation that I made at
http://ppewww.ph.gla.ac.uk/~flavell/charset/quick.html#cons 

>        (except perhaps in {rawhtml} environments)

This is slightly worrying.  If people put coded characters into their
rawhtml environments, then they could get scrambled if users aren't
cautioned about the problem.

There needs to be two independent parameters, one specifying how the
input latex is coded, and one specifying the desired HTML coding.  
rawhtml seems to be a pitfall in this regard, no?

>   C. any non-ASCII characters in the original LaTeX source are
>       translated into &#<number>; (up to 255) or &#<bignumber>;
>       according to Unicode (or ISO-10646) to the extent that such
>       translations have been implemented;

OK

> If the  -entities  command-line switch is also used, then most math
> symbols become "named entities"; e.g.  \alpha --> &alpha;  etc.

This is another point on which Netscape falls down, although as far as
HTML4.0 is concerned, what l2h is doing is entirely correct.  I
propose you ignore the shortcomings of NS <= 4.* in this regard; there
are more important things to invest a limited amount of effort on, and
NS will eventually release version 5 (whatever it'll be called), where
all known problems will be ... replaced by unknown problems.  ;-)

> So use of the term 'Unicode' here refers only to the range of
> allowable parameter entities, and the scheme used to select this
> for the desired character:  &#<bignumber>;

OK

> The input-encoding is *not* Unicode or UTF8, in the sense of octet-streams.

Meaning, I think, that it had better be us-ascii, if there aren't to
be hidden traps waiting for the user.

> Nor is the output in octet-streams.
> (It would be greatly appreciated if someone could provide Perl
> code for doing this.

http://ppewww.ph.gla.ac.uk/~flavell/charset/how-to.html
might be helpful in this regard?

> Does NS or MSIE support this yet? ) 

Sure.  It's one of the few options that NS handles correctly!

However, you do need to have it properly installed and with an
appropriate range of fonts at its disposal (NB it's not the name of
the font that counts here.  "Verdana" or "Times New Roman" may or may
not support an extended character repertoire; it depends which version
you've got).  In my case, I downloaded MS's multinational web font
resources for Win95, from their site.

I also have Bitstream Cyberbit font, as an experiment, but that
wouldn't be everyone's cup of tea.  The font is 12MBytes (!) and it
considerably slows down the rendering when used.  It's no longer
available from the vendor, but there is a copy out there at a download
site.  You don't really need that unless you plan to display Japanese
and Chinese and Cyrillic and Hebrew on your pages.

> I meant mapping  &#<bignumber>;  entities to (combinations of) glyphs
> drawn from the fonts available on the local system.

Indeed.

> e.g. using Symbol for the &#8xxx; math symbols, and Greek letters.

The browser is supposed to sort that out for itself!  In HTML terms,
the safest thing you can do is _not_ specify any font, and caution the
reader to configure their best font for the purpose.  Writing a Roman
letter and trying to make it look Greek by specifying FONT FACE=Symbol
(or the corresponding thing in CSS) is not a valid HTML technique,
though I can't argue with anyone (e.g Ian Hutchinson) who claims that
it's currently going to give the desired visual results on many more
browser installations than the correct HTML can do.

> This is probably rather "Fontasmagorical", as you say.

OK, I think we've made our respective points.

> > Make that "coded characters", since the interesting character codings
> > are not limited to 8-bit characters.  This already works, even in
> > Netscape (at least on suitable platforms).
> 
> Yes; it's the "suitable platforms" limitation that deters me from
> following this path.  ;-)

Well, either it works as advertised (and then l2h's Unicode support is
useful), or it doesn't (and then one needs to use one of l2h's older
techniques).

According to my observations of browsers, useful support for
&#bignumber; and for utf-8-encoded data streams go pretty much hand in
hand.

However, as was remarked earlier, the use of utf-8-coded data could
be a pitfall for users who have used rawhtml environments to supply
coded characters in _their_ choice of character coding.

> It is very easy to include such validation and/or post-processing
> as part of the LaTeX2HTML translation.

OK

all the best