[l2h] Re: Test with revision 1.29

Mon, 20 Sep 1999 11:04:35 +1000 (EST)

[Charset iso-8859-1 unsupported, filtering to ASCII...]
> 1. I have tested with html_version 3_2 et 4_0, combined with nothing,
> latin1 and latin9, the whole combined with unicode or not.
> 
> In all these tests, the only bug is "man\oe uvre" which is badly parsed
> as "man\oeuvre". I obviously expect Ross to track and correct it.

Yes; I plan to do this today.

> 
> 2. latin9 generates garbage in place of the \v{s} and \v{Z} (characters
> are a lone umlaut, a broken |, a lone cedilla (not a circle which I had
> badly examined) and un upper bar. => latin9 is not adapted to my
> european browser, at least.

Yes; those are the latin1 symbols at the appropriate code-points.
It appears that the browser defaults to latin1 when it doesn't
recognise the stated charset.
Since latin9 is very new (perhaps not even accepted yet ?)
then this behaviour is not surprising.

It is up to the browsers to fix this, else latin9 is still-born.

> 
> 3_2 + latin1 yields omits the \v{} but puts corrects s and Z, not 3_2 +
> latin9.
If you set $ACCENT_IMAGES = 'textrm'  or 'textit' etc.
then you get an image of the accented characters.
The value of $ACCENT_IMAGES becomes the style: \textrm  \textit etc.
to be used when typesetting for the image.

I don't think this behaviour should be changed, as the default.
Other options could be tied to special values of $ACCENT_IMAGES .

> 
> 3. The best things (no picture for OE and correct \v{s} \v{z}) are given
> with 3_2|4_0 + latin1|latin9 + unicode. In other words, any HTML
> version, any latin? AND unicode.
> 
> But "unicode" has the drawback of specifying charset=utf-8 which might
> be not recognised by some silly browsers. Therefore I took an html page
> generated with the "unicode" option, and replaced charset=utf-8 with
> charset=iso-8859-1.
> 
> The result was pretty good: the OE and oe were correct (although
> numerically > 255) and the \v was kept on the s or S. As expected, the
> Zcaron disppeared and was replaced with a square.

Yes, again.
Browsers seem to recognise more &#<bignumber>; entities when utf8
is specified, than when not. 
Understandable, since that is what utf8 is for.
What is surprising is that *any* codes >255 are recognised at all
when the charset only specifies up to 255.
Clearly they have been put under some pressure to provide characters
not in latin1. It would be nice to have a definitive list, 
or a statement of policy.

> 
> ====> therefore, I suggest the following options:
> 
> 1. keep coding of the form &#nnn; for all characters > 127 with
> "unicode" (obvious)
> 
> 2. provide a new option which forces coding of the form &#nnn; EVEN when
> "unicode" is not stated. Note that I ask for an OPTION, not for a
> standard.

That is how LaTeX2HTML used to work. It was changed when Eastern European
users, for whom latin1 was not the usual working environment,
complained of inadequate support for their everyday needs.

This could be returned as an option;
request it by:   -html_version 3.2,unicode,latin1 

so that the last encoding specified becomes the one used,
but the entity numbering for Unicode is loaded.

I may need to do a little more programming to make this work.

Note that pages done this way 
 a. are not strictly valid SGML 
	(since there are parameter entities outside the stated
	 range of points in the charset)
 b. browsers may not recognise some characters > 255
	(as you have seen above)

> 
> 3. provide a second (orthogonal) option to code all characters defined
> in HTML 4.0 in the alphabetic form, e.g. &eacute; instead of &#233;
> With such a representation, people having a poor browser will probably
> see  &eacute; instead of the accented letter, but it will be rather
> understandable (moreover that can then know which characters pose
> problems).

This is a good suggestion.
Such a feature already exists for math-symbols when the  'math'
extension is loaded. It can be extended to be applicable to
text-processing as well.

The switch is:   -entities   (or  -noentities  to turn it off
when  $USE_ENTITY_NAMES  has been set).

Programming this will be easy, except...

...for valid HTML documents, I'll need to include an appropriate
meta-tag to refer to a list of the relevant entity names.
Can anyone send me the best tags to use for this ?

> 
> 4. provide a third (orthogonal) option to code numerically the
> characters having a duplicate position between 128 and 2+159, namely: 

This is the "Windows" version of Latin1, also known as "CP1252".
It is already supported in LaTeX2HTML and LaTeX,
using   -html_version 3.2,cp1252   or  -html_version 4.0,cp1252

Whether browsers recognise  charset="cp1252"  correctly is something
that I haven't tested. I'd suspect that some do, others not.
Confirmation please ?

>  131 = florin
>  132 = opening german quote (ligature ,, )
>  133 = ...
>  134 = dag
>  135 = ddag
>  136 = lone circumflex
>  137 = perthousand
>  138 = \v S
>  139 = a single opening guillemet
 ...

Note that cp1252  includes a few extra characters:

  128 = euro
  130 = single opening quote on baseline:  ,
  142 = \v Z
  158 = \v z

> 
> My suggestions do not change the choosen standard, but if the user wants
> a different choice, he can then make, under his own responsibility.

The option exists already.
Whether he browsers will recognise it is another question.

> 
> Note that the advantage of &#nnn; coding without "unicode" is that the
> page source is easy to look at, while Netscape views unicode HTML as two
> bytes, one of which is an emty square (terrible to look at...)
> 
Yes. Unicode HTML is meant to be 16-bit, hence octet-pairs
(i.e. 2 x 8-bit characters for each letter).

I'm already thinking about implementing this properly,
to avoid &#<bignumber>; altogether.

The other alternative is to do utf8 encoding properly,
with variable length character representation;
i.e. 1 byte for ascii, 2 for code-points < 2^{11}
  3 for code-points < 2^{16}, and 4, 5 or 6 for higher points.

It's probably too soon to be generally useful,
but surely it's the way to go for future developments.
Comments please.

All the best,

	Ross Moore