[l2h] Various bugs and workarounds for v99.1 under Linux

Ross MOORE Ross MOORE <ross@ics.mq.edu.au>
Thu, 15 Jul 1999 18:49:04 +1000 (EST)


Great stuff Julius,

Thanks for doing this work.
I'll comment immediately on some of the points you raise,
leaving others till when I can take a closer look at them.


> Here is a report on a few bugs and work-arounds after shaking out a few things using l2h-99.1 on a new Linux system.  The only thing that's requiring time-consuming workarounds is the extremely fragile caption-matching logic (which makes figure numbers disappear when there is any math or font-setting commands in the caption).
> 
See below for comments on captions.

> -------------------------------------------------------
> 
> File redirection fails in &syswait() under Windows NT:
> 
> I never resolved failing file indirection via ">" as in
> &syswait("... texexpand ... > file ") under Windows NT.  After

This I'll leave to the NT experts. Uli? Marek?

> spending a few hours and several email messages on the problem, the
> easiest workaround for me was to move all l2h work to my Linux system.
> I know others have gotten l2h working under NT, but some other
> reports indicate that headbutts are common.  Since l2h
> leverages UNIX tools so heavily, it appears NT is a second-class
> platform for l2h work.  Maybe Windows 2003 will run on top of
> Linux, in which case I'll try again!  :-)
> 


> I will now report a few problems using l2h 99.1
> under Red Hat Linux 6.0 (Intel hardware):
> 
> -------------------------------------------------------
> 
> Manual bug: "$ACCENT_IMAGES = 'large';" doesn't work:

You are supposed to use LaTeX commands, not declarations here.
e.g. textrm  
since the constructed code is then  \textrm{<the accented char>} 

Even so, it's not clear to me why \large doesn't work.
Would you please send me the images.log file from such an attempt?


>        $ACCENT_IMAGES = 'large';
> 
> should be something more like
> 
>        $ACCENT_IMAGES = 'simplemathrm';
> 
> Also, why is this variable not defined by default?  It seems to me it

 a. It is not clear, at least not to me, what should be the default.

 b. Having it empty by default isn't that bad an idea.
If you wrote LaTeX source without math, so expecting no images,
then this is what you'll get. It's not uncommon in general writing
or typesetting to omit accents when you don't know how to get them.
The warning message at the end tells you there is a possible problem,
but you can ignore it and still have a valid document.

If accented characters are common, then it is best to have proper
8-bit font-characters; either iso-8859-1 (latin1) or iso-8859-2 (latin2)
etc. or even Unicode, via utf-8 as the character set.
I've just completed modifications that will be in v99.2 that allow for
upper-plane 8-bit characters to be retained (if that was the input-encoding)
or generated by LaTeX2HTML for macros (e.g. \'a etc.) or with Babel
shortcuts such as "a .

With this new code, the need for images of accented characters is greatly
reduced. 

 c. images of accented characters should be exceptions, not the rule,
in any document --- otherwise the input encoding has not been well-chosen.
For example, the need may occur only within foreign language quotations,
in which case  $ACCENT_IMAGES = 'textit';  would be best.
Or maybe they occur only in a list of names; e.g. in the bibliography,
or in a list of authors --- 100+ is typical for CERN preprints ;-)

 d. when you are aware of the $ACCENT_IMAGES variable, and its limitations
(e.g. the style is the same for *all* accents made into images, not matter
what the context in which the need arose) then it is easy enough to make
the correct/best choice for it.
For less experienced users, unaware of how it should be used, then it is
better to not make incongruous images but warn that the accents have been
omitted.


> should be added to latex2html.config with a line or two of
> explanation.
 
I think it requires several paragraphs to explain properly.
$ACCENT_IMAGES is meant to be a last resort, fall-back option,
when other/better alternatives have failed or are not appropriate
for some reason.


> -------------------------------------------------------
> 
> Can't safely modify DESTDIR in latex2html.config:

Ouch; please not in  latex2html.config .
By all means use it in  .latex2html-init  files, for jobs
specific to a single person or directory.
Since all jobs read  latex2html.config  you are asking for trouble
with clashes of filenames (e.g. of images) if you set it there.

> I tried:
> 
> # -dir
> #$DESTDIR = '';          # Put the result in this directory 
> $DESTDIR = 'HTML';
> 
> However, this breaks section linking when compiling the l2h manual.
> Files like Ointernals.pl are written to manual/, while files like
> node1.html are written to HTML.  l2h then later complains that it
> can't find Ointernals.pl, etc.

OK; that's a complicated situation with a segmented document.
Since you need to control this from a Makefile, then you really
should put all non-standard settings inside the Makefile,
and pass them on the command-line for each part of the entire job.
Doing it any other way would be really difficult to control;
portability would become almost impossible.


> -------------------------------------------------------
> 
> Pattern-matching failures in figure \caption s:
> 
> The first document below converts to HTML correctly, while the
> following three do not.  The failure in each case is that the figure
> number is lost in the caption.
> 
> \documentclass{article}
> \begin{document}
> \begin{figure}
> 	The figure.
> 	\caption{A winning caption.}
> \end{figure}
> \end{document}
> 
> \documentclass{article}
> \begin{document}
> \begin{figure}
> 	The figure.
> 	\caption{A losing caption with $math\; in\; it$.}
> \end{figure}
> \end{document}

The reason for this problem is easy --- it is LaTeX's fault,
not LaTeX2HTML's.  Look at the .aux file for this:

\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces A losing caption with $math\mskip \thickmuskip  in\mskip \thickmuskip  it$.}}{1}}


You must write your code as:

\caption{A losing caption with $math\protect\; in\protect\; it$.}

Rerun LaTeX on the document, to get in the .aux :

\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces A losing caption with $math\; in\; it$.}}{1}}

which now *will* match what LaTeX2HTML sees for the caption.


The problem is that LaTeX is allowing \; to be expanded before writing
into the .aux file --- it is not a "robust" command.
Why not ? 
Ask the LaTeX3 team --- probably because it is inherited from TeX,
and the expansion causes no trouble in LaTeX documents.

IMHO it is theoretically wrong for LaTeX to be not making the line written
to .aux identical with the line that was originally supplied for the \caption
by the author, replacing only counter values and suchlike, and perhaps
replacing user-defined macros by their expansions.
However, given the nature of TeX as a programming language, this is not an easy
thing to do; so I can well understand why LaTeX does not behave this way at present.



> \documentclass{article}
> \begin{document}
> \begin{figure}
> 	The figure.
> 	\caption{A losing caption with a discretionary hy\protect\-phen in it.}
                                                         ^^^^^^^^
> \end{figure}
> \end{document}
> 
> \documentclass{article}
> \begin{document}
> \begin{figure}
>         The figure.
> 	\caption{\protect\small A caption with a font size set in it.}
                 ^^^^^^^^
> \end{figure}
> \end{document}

Same problem, same solution.

> I don't understand why caption recognition for associating figure
> numbers is so fragile.  Perhaps there's a good reason for this, but
> perhaps there's another possible approach to this function?  Missing

LaTeX2HTML could keep an internal counter for this function, to be used
when a number could not be obtained from the .aux file.
However what if the two methods somehow got out of synchronisation?
This could easily happen in a segmented document, where only a portion
of the document is being processed in a given run.

It would also upset the property of LaTeX2HTML processing whereby
environments are handled independently of each other (except when
nested, of course). Thus an error which causes one figure environment
to be totally messed-up, or even omitted altogether, should not
affect subsequent correctly formed environments.
With an internal counter, that failed to be incremented,
then *all* subsequent figures would be affected.

When this topic was last discussed on this list, the consensus viewpoint
was strongly in favour of the LaTeX2HTML numbering agreeing with what
LaTeX produces --- e.g. browse a document on the web, with a LaTeX
printed version at hand; the numbered references should agree.


My own view is that you should not use numbered references at all within
electronic documents --- leave it all up to the active hyperlinks.
However I feel that I'm in a minority with this, and have created many
documents where numbering is essential, because you know that
your readers are going to want to print the document --- question/answer
sheets for student assignments, for example.


> figure numbers is a recurring source of headbutts for a lot of people.
> It is very common to want to include a little math or change something
> about the font in a figure caption.

It's easy enough, once you learn the trick.


> -------------------------------------------------------
> Manual bug:
> 
> The figure caption recognition for figure numberse should be at least
> documented.

A better description of how numbering is obtained could be useful, yes.


> I didn't obtain simple failing cases as above, but I also had
> figure-number failures with captions that started with a newline or a
> quoted newline and which were passed in as a macro argument. For
> example,
> 
>        \doFigure{theLabel}{theFigure.eps}{%
> The caption.}
> 
> or
>        
>        \doFigure{theLabel}{theFigure.eps}{
> The very long caption.}
>        
> These failures would not occur in the small test example I constructed
> above.

Try your larger examples using \protect in appropriate places.
Tell me if any fail.
 
> -------------------------------------------------------
> Cannot define a figure macro before starting the document:
> 
> Below, the first form works and the second does not.  It fails in a
> complicated way as if the macro definition is being processed as an
> actual invocation with literal arguments #1, #2, and #3.
> 
> \documentclass{article}
> \newcommand{\doFigure}[3]{
> 	\begin{figure}
> 	        #2
> 		\caption{#3}
> 		\label{#1}
> 	\end{figure}
> }
> \begin{document}
> \doFigure{theLabel}{theFigure}{theCaption}
> \end{document}
> 
> 
> \documentclass{article}
> \begin{document}
> \newcommand{\doFigure}[3]{
> 	\begin{figure}
> 	        #2
> 		\caption{#3}
> 		\label{#1}
> 	\end{figure}
> }
> \doFigure{theLabel}{theFigure}{theCaption}
> \end{document}
> 
> -------------------------------------------------------

It is inadvisable to make complicated definitions after the \begin{document}
command. 

Indeed, as a general style for LaTeX, any \newcommand definition that
is not going to be over-ridden later with \renewcommand  should be made
within the preamble. 

Think in terms of LaTeX being "markup", not programming code.
Then the information of your document comes after \begin{document}
and what comes before is mainly about organisation and presentation;
with some specialised information contained in macro definitions.

With such a clear distiction of roles, then your document has the potential
to be correctly interpreted by a processing system not based on TeX.
In a sense, LaTeX2HTML is precisely such a system.

Mixing the organisation with the information just makes it that much harder
for the processing system to unravel what it is that needs to be presented,
and what can be discarded as irrelevant to the form of output.



> Last but not least, I should add that I'm getting great results after
> finding the needed workarounds.  L2h is truly an awesome perl script!  

good adjective. 

 
>      "Latex2html is so improbable, that if it did not exist, the
>      possibility would not be worth discussing."

I like it.  ;-) ;-)

 

You must realise that TeX is a page-description language.
The order in which things appear in the .dvi file is irrelevant,
provided it appears in the correct place on the printed page.

HTML is not like this at all.
Order and correct nesting of the pieces is of paramount importance,
if browsers are to be able to create a reasonable page-layout.

LaTeX is a hybrid language, in which the "markup" ideas of HTML,
SGML and earlier systems, is expressed using the TeX language.
However most LaTeX users see LaTeX as a programming language,
which it most definitely is not --- at least it isn't complete in
any sense; you need TeX for that.

LaTeX2HTML attempts to capture the user's intention with the markup,
and tries to do the best it can with any programming constructions.

Is it any wonder that the Perl script needs to be so *awesome* ?



All the best,

	Ross Moore