[l2h] Confused about Unicode support

Ross MOORE Ross MOORE <ross@ics.mq.edu.au>
Thu, 1 Jul 1999 20:53:28 +1000 (EST)


Hi Alan,

Thanks for your comments; they are always valuable.

> 
> At the moment, Netscape doesn't properly support even a reasonable
> subset of unicode, unless the character coding (that misleadingly
> named "charset" parameter) is set to something that implies a unicode
> encoding (e.g utf-8).  So &#bignumber; is unavailable when one most
> needs it, and works only when there is no need for it.  Paradox.

Despite the looseness of my terminology, I think that we are saying
the same thing, and that LaTeX2HTML is at least trying to do
what would be expected. But let's confirm that.


The character encoding is the encoding of the characters in the HTML
file, should there be anything other than ASCII.
It corresponds to e.g. the 'latin2' or 'iso-8859-2' in

 1.  -html_version 3.2,latin2,.....   on the LaTeX2HTML command-line

 2.   \inputenc{latin2}               in the LaTeX source

 3.  content = "text/html; charset=iso-8859-2"   in a <META> tag

 4.  $CHARSET = 'iso-8859-2';         in .latex2html-init
	            (unless overridden on the command-line, as in 1.)

Stated loosely, this is the "input-encoding".
When this is the value of $CHARSET then LaTeX2HTML does not generate
parameter entities &#<number>;  with <number> larger than 255.
Instead, images are created for glyphs that have no encoding within
the specified character coding.

However, there is no attempt made to verify the validity of any
parameter entities included within {rawhtml} environments.


When using command-line options such as:

    -html_version 4.0,.....,unicode,....

the situation is quite different:

  A. content="text/html; charset=utf8"  is put into the <META> tag;
  B. the HTML file contains only ASCII characters
       (except perhaps in {rawhtml} environments)
  C. any non-ASCII characters in the original LaTeX source are
      translated into &#<number>; (up to 255) or &#<bignumber>;
      according to Unicode (or ISO-10646) to the extent that such
      translations have been implemented;
  D. most TeX-accents are translated to parameter entities, as in C.
  E. most math symbols are translated to parameter entities also.

If the  -entities  command-line switch is also used, then most math
symbols become "named entities"; e.g.  \alpha --> &alpha;  etc.
Images are made of math-symbols that cannot be translated to entities.


Note:
The comments concerning math symbols are subject to the various
modes for translating mathematics, so apply only when significant
parsing of mathematics is being performed (i.e. with  math.pl  loaded).
Even then, when math symbols occur in single-line vertical alignments,
(such as fractions) then an image is still made.


So use of the term 'Unicode' here refers only to the range of
allowable parameter entities, and the scheme used to select this
for the desired character:  &#<bignumber>;

Indeed only a limited subset of allowable parameters are actually
supported --- see the file  ..../versions/unicode.pl  .

The input-encoding is *not* Unicode or UTF8, in the sense of octet-streams.
Nor is the output in octet-streams.
(It would be greatly appreciated if someone could provide Perl
code for doing this. Does NS or MSIE support this yet? ) 



> But Netscape's bugs do not define the HTML standard.  MSIE and Lynx
> caught up with Alis Tango quite a while back in this respect. And it
> looks as if the development Mozilla has got this sorted out (at least
> I have verified this in the Win'95 version).
> 
> > There is probably a way to do it already, by defining mapping tables
> > into fonts on your system; however I don't know how to do this,
> > and it is probably a bit different on different platforms.
> 
> If you're talking about HTML here, rather than latex, then let's try
> to decouple "fonts" from character representation.  In HTML, either a
> font does support a given character or it doesn't.  HTML
> specifications do _not_ envisage extending the character repertoire by
> swapping named fonts (what I call "Fontasmagorical fantasies" in my
> article http://ppewww.ph.gla.ac.uk/~flavell/charset/internat.html )

I meant mapping  &#<bignumber>;  entities to (combinations of) glyphs
drawn from the fonts available on the local system.
e.g. using Symbol for the &#8xxx; math symbols, and Greek letters.
This is probably rather "Fontasmagorical", as you say.


> > > correctly - I think that &# requires unicode number (regardless
> > > selected charset),
> 
> Absolutely
> 
> > > and maybe in future it would be possible to
> > > generate 8-bit characters rather then entities.
>              ^^^^^^^^^^^^^^^^
> 
> Make that "coded characters", since the interesting character codings
> are not limited to 8-bit characters.  This already works, even in
> Netscape (at least on suitable platforms).

Yes; it's the "suitable platforms" limitation that deters me from
following this path.  ;-)

 
> > I haven't attempted to do this, mainly through lack of means
> > to do adequate testing, and also because  latin1  is the official
> > charset of HTML3.2  and  Unicode  is the charset for  HTML4.0 .
> 
> Unicode (well, iso-10646 to be pedantic) is the Document Character Set
> of HTML.  This was already advertised to be the way forward in RFC1866
> (HTML2.0).  Do not confuse this with the character coding (that
> unfortunately-named "charset" parameter on the Content-type header).
> 
> In HTML3.2 it's true that the only "charset" that clients were
> required to support was iso-8859-1, and this was assumed as default if
> none was specified.  But RFC2070 had given fair warning of how other 
> character codings ("charsets") are to be handled.

That is where  \inputinc{....}  and  -html_version ....,latin2,... 
do their thing.


> In HTML4.0, technically there _is_ no default charset (curiously, this
> contradicts the HTTP specification), indeed there is no actual
> _requirement_ on client agents to even support iso-8859-1, although it
> would seem crazy to fail to support it!
> 
> > I did some of the work needed to recognise other input encodings,
> > and convert to Unicode; however it isn't complete (e.g. for Greek,
> > Arabic, Hebrew, etc.) 
> > What I'd like is for native-speakers to complete these modules.
> 
> I'm no expert in that field, but it's been my observation that many
> non-Latin writing systems are doing their HTML in various pragmatic
> ways that are quite tangential to the published HTML specifications.
> 
> It would seem a pity to crystallise those quasi-HTML techniques into
> l2h, just when browser makers are finally taking the steps that were
> drafted out for them back in RFC2070.
 
I have no enthusiasm for such a task, but if anyone else wants to
go that way, then their work can be included with the LaTeX2HTML
distribution, should they so desire.


> > Similarly, for translating back into specific 8-bit encodings,
> > that work should be done by someone with the need for it,
> > and the ability to do adequate testing.
> 
> Much of this work has already been done in related contexts, such as
> the SP package, the WDG's HTML validator at
> http://www.htmlhelp.org/tools/validator/ (in this regard, see
> http://www.htmlhelp.org/tools/validator/supported-encodings.html )
> and some Perl modules.

It is very easy to include such validation and/or post-processing
as part of the LaTeX2HTML translation.
Indeed with v99.2, it is simply a matter of setting a variable
within the new installation procedure. It may even be in v99.1.
For example, I automatically run  html-check (basically  nsgmls )
with every page produced by LaTeX2HTML.

 
> I see quite a considerable discussion on the newsgroup
> netscape.public.mozilla.i18n about the browser dealing with
> nonstandard kinds of character representation in HTML.  Sending l2h
> down that path would be a considerable amount of work, and IMHO
> misguided, but, as I say, my interest is in the technology of
> representing characters, and I have very little live experience in
> using those non-Roman character systems, so consider me prejudiced ;-)
> 

I think some of it should be done, where there is a real need to translate
documents that are written with appropriate TeX/LaTeX installations. 
But this means implementing more than just the character coding
manipulations. There will be specific language/layout/ligature requirements
that require experienced users of the language to implement correctly;
that counts me out, though I'm willing to help as best I can.


All the best,

	Ross Moore