[l2h] Confused about Unicode support

Thu, 1 Jul 1999 10:19:46 +0100 (BST)

On Thu, 1 Jul 1999, Ross MOORE wrote:

> One day Netscape will have full font support for all of Unicode.

At the moment, Netscape doesn't properly support even a reasonable
subset of unicode, unless the character coding (that misleadingly
named "charset" parameter) is set to something that implies a unicode
encoding (e.g utf-8).  So &#bignumber; is unavailable when one most
needs it, and works only when there is no need for it.  Paradox.

But Netscape's bugs do not define the HTML standard.  MSIE and Lynx
caught up with Alis Tango quite a while back in this respect. And it
looks as if the development Mozilla has got this sorted out (at least
I have verified this in the Win'95 version).

> There is probably a way to do it already, by defining mapping tables
> into fonts on your system; however I don't know how to do this,
> and it is probably a bit different on different platforms.

If you're talking about HTML here, rather than latex, then let's try
to decouple "fonts" from character representation.  In HTML, either a
font does support a given character or it doesn't.  HTML
specifications do _not_ envisage extending the character repertoire by
swapping named fonts (what I call "Fontasmagorical fantasies" in my
article http://ppewww.ph.gla.ac.uk/~flavell/charset/internat.html )

> > correctly - I think that &# requires unicode number (regardless
> > selected charset),

Absolutely

> > and maybe in future it would be possible to
> > generate 8-bit characters rather then entities.
             ^^^^^^^^^^^^^^^^

Make that "coded characters", since the interesting character codings
are not limited to 8-bit characters.  This already works, even in
Netscape (at least on suitable platforms).

> I haven't attempted to do this, mainly through lack of means
> to do adequate testing, and also because  latin1  is the official
> charset of HTML3.2  and  Unicode  is the charset for  HTML4.0 .

Unicode (well, iso-10646 to be pedantic) is the Document Character Set
of HTML.  This was already advertised to be the way forward in RFC1866
(HTML2.0).  Do not confuse this with the character coding (that
unfortunately-named "charset" parameter on the Content-type header).

In HTML3.2 it's true that the only "charset" that clients were
required to support was iso-8859-1, and this was assumed as default if
none was specified.  But RFC2070 had given fair warning of how other 
character codings ("charsets") are to be handled.

In HTML4.0, technically there _is_ no default charset (curiously, this
contradicts the HTTP specification), indeed there is no actual
_requirement_ on client agents to even support iso-8859-1, although it
would seem crazy to fail to support it!

> I did some of the work needed to recognise other input encodings,
> and convert to Unicode; however it isn't complete (e.g. for Greek,
> Arabic, Hebrew, etc.) 
> What I'd like is for native-speakers to complete these modules.

I'm no expert in that field, but it's been my observation that many
non-Latin writing systems are doing their HTML in various pragmatic
ways that are quite tangential to the published HTML specifications.

It would seem a pity to crystallise those quasi-HTML techniques into
l2h, just when browser makers are finally taking the steps that were
drafted out for them back in RFC2070.

> Similarly, for translating back into specific 8-bit encodings,
> that work should be done by someone with the need for it,
> and the ability to do adequate testing.

Much of this work has already been done in related contexts, such as
the SP package, the WDG's HTML validator at
http://www.htmlhelp.org/tools/validator/ (in this regard, see
http://www.htmlhelp.org/tools/validator/supported-encodings.html )
and some Perl modules.

I see quite a considerable discussion on the newsgroup
netscape.public.mozilla.i18n about the browser dealing with
nonstandard kinds of character representation in HTML.  Sending l2h
down that path would be a considerable amount of work, and IMHO
misguided, but, as I say, my interest is in the technology of
representing characters, and I have very little live experience in
using those non-Roman character systems, so consider me prejudiced ;-)