[XeTeX] xe(la)tex to epub?

Khaled Hosny khaledhosny at eglug.org
Wed Aug 18 01:57:47 CEST 2010

On Wed, Aug 18, 2010 at 08:11:06AM +1000, Ross Moore wrote:
> Hi Khaled and Michiel,
> On 18/08/2010, at 6:58 AM, Khaled Hosny wrote:
> > On Tue, Aug 17, 2010 at 01:16:02PM -0700, Michiel Kamermans wrote:
> >> Khaled,
> >> 
> >>> AFAIK, epup is just a subset of xhtml with a subset of css2, so IMO not a kind of output format that is very well suited for TeX (well, I hardly consider html an output format at all, the output is what the browser renders out of it).
> >> For print media the epub format is, of course, nonsense. Hence the
> >> desire for parallel format generation.
> > 
> > I understand the benefits of EPUB, what I don't understand is the need
> > for TeX at all.
> To me the problem is not about using TeX for formatting,
> it is about obtaining different output formats from
> the same (La)TeX sources --- especially when math formulas,
> and other 2-dimensional layouts, are involved.
> Since ePub, and similar, are XML- or XHTML-based, you want the
> detailed structure of the tagging to be produced automatically,
> without having to make edits on each output result, to "get it right".
> You want to enter your information in just one place, in a language
> that the author already understands and can use effectively.
> Software should then do the rest, modulo possible minor tweaking 
> at the end.

If that is the case, I wouldn't start with TeX as input format, but with
some thing else easier to parse with 3rd party tools to get different
output formats. XML is the preferred by industry, and there are
structural XML based formats like DocBook with tools to convert it to
many output formats including HTML and LaTeX or even EPUB.

However, If I'm to do that myself, I'd even try something much simpler
like Markdown.

> This is not just simply a matter of redefining macros, because the
> structure rules for the markup can be quite different for different
> output formats. So some kind of knowledge about what macros are being
> used for, and what kinds of things will follow after, is required 
> of any translation software. 
> Since LaTeX, processing to PDF as a major form of output, figures
> to be the comfortable input format, this is desirable for encoding
> the author's work --- though some may say it ought to be in XML.
> And since TeX already understands the expansion of macros and their 
> arguments, it is attractive to want to use it as a starting point
> for generating other formats; but certainly it cannot be the 
> whole shebang.

Trying to parse TeX input is something that I'd not try to do in my
right mind, but others have did that, PlasTeX seems to work nicely and
generates clean HTML. But since you loose all the visual formating of
TeX, the remaining structural formating is not worth the trouble, you
can get with more parser friendly formats.

> For instance, in my work for Tagged PDF, an XML version will be able
> to be exported (using Adobe Acrobat Pro) from the complete PDF.
> Mathematics will be fully tagged as MathML, in this view.
> Other PDF readers may only see the rendered pages, but others may
> be able to use the tagging to extract an alternative view suitable
> to their own display screen.
> > (X)HTML is dynamic by nature, you should be able to
> > resize or change text size and the layout will re-flow, forcing a rigid,
> > box based layout that is a direct translation of TeX output just does
> > not make much sense to me.
> I agree that it is not the TeX *output* that needs to be further 
> processed, but the input source --- or something intermediate 
> that can be generated and written to a file as a by-product 
> of LaTeX processing, with extra packages loaded to achieve this.
> TeX4Ht works by putting extra information into the .dvi file, 
> to encode the required tagging. An extra post-processor is required
> to extract this information, producing HTML or XML or whatever.
> That is very similar to what I do for Tagged PDF, where the 
> extra post-processor is Acrobat Pro. This is even more flexible
> than TeX4HT, since Acrobat can export into a range of formats, 
> whereas TeX4ht only produces the format that was specified when 
> the .dvi was being created.

As I wrote above, if it is about the structural formating, then it does
not worth the trouble, it can be achieved with almost every tool and
document format out there (even office suits can build structured
documents). It is visual, the precise output, where TeX excels which is
totally lost during such conversions.

This can be useful, however, if one have existing TeX material that need
to be processed to other output format, though one can still argue that
converting it ones to some sort of XML is much better long term plan.

Don't get me wrong, I like TeX syntax and find it more easier to author
with than many other markups, but I accept that it does not fit every


 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer

More information about the XeTeX mailing list