[l2h] Re: Test with revision 1.29

Mon, 20 Sep 1999 21:54:57 +0100 (BST)

On Mon, 20 Sep 1999, Ross Moore wrote:

> Browsers seem to recognise more &#<bignumber>; entities when utf8
> is specified, than when not. 

This is a serious bug in Netscape.  I'm not aware of the effect
existing in any other browsers.  Of course, some browsers (Opera) 
still don't even _try_ to support unicode yet.

> Understandable, since that is what utf8 is for.

Excuse me blowing a raspberry at this point!

It may be "understandable" in terms of the general inability of
browser developers to read the specifications.

&#bignumber; representations are required precisely _when_ unicode
coded characters (utf-8 or some other acceptable coding of unicode)  
are _un_available, e.g when some 8-bit coding (or even 7-bit us-ascii)
is used. So, in Netscape 4.xx this works when it isn't needed, and
fails precisely when it _is_ needed.

A failure to cope with this this represents a fundamental
incomprehension of RFC2070, which is entirely excusable in the general
population, but hardly in a browser developer.

And indeed we see that Moz.5 finally has its teeth into this tiresome
problem.

> What is surprising is that *any* codes >255 are recognised at all
> when the charset only specifies up to 255.

If you mean it's surprising when a browser occasionally conforms to
the specification, then I suppose I can't argue.  But I don't know why
you seem to be looking for excuses for this shameless misbehaviour;
even Lynx knows how to get this right.

> Clearly they have been put under some pressure to provide characters
> not in latin1. It would be nice to have a definitive list, 

I think you're going to find that Netscape 4.x supports the character
repertoire of one of the Windows encodings, say codepage 1252, when
the incoming document uses an 8-bit coding.  I haven't studied the
details - I felt that the thing was basically so useless in this form
that it wasn't worth investigating more closely.  And hence my
advertised trick of coding everything in us-ascii, and advertising it
as utf-8 to fool Netscape.

> or a statement of policy.

Fortunately, that's all water under the bridge now.  The policy now
is to implement the specification.  Looks as if this time the
developers really do understand the specification that they're
implementing.

> Whether browsers recognise  charset="cp1252"  correctly is something
> that I haven't tested. I'd suspect that some do, others not.

MS have stated that the correct charset value will be windows-1252,
and will be registered with IANA, to join all the other values
of windows-125x which have long-since been residing there.  They
have no intention of registering cp1252, was the message I got, in
spite of it being the only designator that appears in their filing at
the unicode site!!   F.U.D anyone?

> Note that cp1252  includes a few extra characters:

Look, rather than collecting them here I think you'd do better to
set a bookmark to the best "authoritative" source that I know, 
at the unicode ftp site:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

There is of course no requirement on client agents (browsers etc.) to
accept any particular character coding, especially not a
vendor-defined one.  And indeed a pedantic reading of the RFCs would
indicate that an HTTP header that specifies a character coding that
has not been registered at IANA is in a state of sin...

I'm not sure what advice to give at this juncture.  At least, please
don't paint anything into a corner just because of the shortcomings of
one browser, no matter how popular, bearing in mind that Moz.5, if and
when it comes out, will eat all the earlier versions of _that_
vendor's line for breakfast.  Those who insisted on using its old
versions thereafter, could hardly complain if forced to stick with
inline images in latex2html, hmmm?

all the best