[tex4ht] oolatex odt needs cleanup

Martin Weis martin.weis.newsadress at gmx.de
Sat Jun 5 19:40:04 CEST 2010

Dear tex4ht-community!

I find tex4ht a very useful tool, I use especially oolatex often. There
I found some unexpected behaviour/markup in the resulting final odt,
which somebody might be able to explain?
My version of tex4ht is: tex4ht.c (2009-01-31-07:33 kpathsea)

Here is a minimal example to demonstrate:

 author = {First Author and Second Authorsname},
 title = {Title of the cited article},
 journal = {Journal of Test},
 year = {2010},
 volume = {20},
 number = {1},
 pages = {23-42}


\title{Title of the article}
\author{First Author\thanks{Affiliation of first author} \and Second
Authorsname\thanks{Affiliation of second author}}
This is the introduction, we refer here to the conclusions in
section~\ref{sec:conclusion}. And we would like to cite \cite{author2010}.

Our most used equation~\ref{eq:pythagoras} was invented by Phytagoras.
which can also be written as $c=\sqrt{a^{2}+b^{2}}$.

Since we did not introduce anything in section~\ref{sec:intro}, there
are not many conclusions to find here.



In the resulting odt there are links (for \ref and \cite commmands), but
they are with an additional space. This applies to inline formulas, too.
In the example:
> This is the introduction, we refer here to the conclusions in section
> 2 . And we would like to cite [1] .
            --^                             --^
> Our most used equation 1  was invented by Phytagoras.
which can also be written as [formula] .

In the content.xml (unzip the odt with unzip -d odt_unzipped
example.odt) the following xml snippet can be found (with original
linebreaks, sorry for the long lines):

> <text:p text:style-name="First-line-indent">   Our most used equation<text:s/>1<!--tex4ht:ref: eq:pythagoras 
> --><text:span text:style-name="reference-ref"><text:reference-ref text:ref-name="x1-1001r1" text:reference-format="text"> </text:reference-ref></text:span> was invented by Phytagoras. </text:p> 
> <table:table table:style-name="equation"><table:table-column table:style-name="equ-col"/> 
> <table:table-column table:style-name="equ-num-col"/> 
> <table:table-row><table:table-cell table:style-name="equ-cell"><text:p text:style-name="equ-p"><text:reference-mark text:name="x1-1001r1"> </text:reference-mark>
> <!--l. 34
> --><draw:frame draw:name="mobj-4" draw:style-name="mml-display" draw:z-index="0" text:anchor-type="paragraph"><draw:object xlink:actuate="onLoad" xlink:href="./odtclean-m4" xlink:show="embed" xlink:type="simple"/></draw:frame> </text:p></table:table-cell> 
> <table:table-cell table:style-name="equ-num-cell"><text:p text:style-name="equ-num-p">(1)</text:p></table:table-cell></table:table-row></table:table>
> <!--l. 37
> --><text:p text:style-name="Like-Text-body">
> which can also be written as <!--l. 38
> --><draw:frame draw:style-name="mml-inline" draw:name="mobj-5" text:anchor-type="as-char" draw:z-index="0"><draw:object xlink:href="./odtclean-m5" xlink:type="simple" xlink:show="embed" xlink:actuate="onLoad"/></draw:frame> .
>    </text:p> 

where the spaces can be found between the "text:reference-ref" tags:
text:reference-format="text"> </text:reference-ref>
and after
</draw:frame> .

There might be some more (e.g. before 'Our' and </text:p>), but these
seem to be interpreted well at least by OpenOffice.org.

I use this sed script to clean up the content.xml:

#!/bin/sed -f
# cleanup the spaces for refs
s#text:reference-format="text"> <#text:reference-format="text"><#g
# cleanup the additional spaces after displaymath env.
s#</draw:frame> #</draw:frame>#g

Additionally the footnotes of the authors affiliation are at the wrong

If anybody can explain or change this behaviour, I would be glad.

Martin Weis

