[tex4ht] How to get PDF's page numbers in html output? (Accessibility issue)
hesitz at gmail.com
Fri Dec 23 00:41:25 CET 2011
Susan Jolly <easjolly at ...> writes:
> This poster's question is significant for accessibility. Braille, large
> print, speech, and other accessible versions of print editions typically use
> the page numbers of the (base) print (paged media) edition to allow users of
> accessible documents to communicate with each other and with users of the
> print edition. While I appreciate that the concept of "page number" is
> somewhat meaningless when using eReaders, the accessibility community has
> not AFAIK addressed alternative solutions. So at least in the forseeable
> future this is a capability that accessible media producers need.
Good point, which I hadn't though of. My own query is driven by a slightly
different but related need: an academic setting where students may be using
ebook, html, and/or pdf versions. Without having some kind of location-based
counter common to the text of all versions there's no good way for users of
different versions to refer to reference location of a particular passage.
The counter doesn't need to be the pdf page number, but that's an
already-existing counter that makes sense. Whatever counter is used, it must be
present in all versions.
In looking further at tex4ht I'm not sure merely having ability to insert a
counter at page breaks would solve this problem. I have .tex files that I
process to PDF using pdflatex, and which I process with tex4ht's htlatex to get
The problem I see is that tex4ht alters the formatting in the process of
generating the html. tex4ht first compiles the document to an intermediate dvi,
then uses that dvi to generate the html. I had expected the pagination of the
dvi file to correspond to the pagination of the pdf generated by pdflatex.
Unfortunately, the pagination does not necessarily match. I'm not sure what
formatting changes tex4ht makes as part of compiling to dvi (besides disabling
header and footer, which would not necessarily affect pagination). So merely
having ability to hook in and put in a page counter for each new dvi page would
not necessarily give pagination markers that correspond to the PDF.
I wonder whether there are some optional settings in tex4ht that would make the
dvi pagination match (or even closely match) the pagination in the PDF.
I see a non-tex4ht-related way to generate the page numbers I want in the html,
but it's not trivial. Basically, the steps would be these:
Given a .tex document:
1. Generate a PDF using pdflatex.
2. Generate html using htlatex.
3. Use utility like pdftohtml to get text from the PDF generated in step 1.
Assuming the PDF has page numbers in header or footer, the html from
pdftohtml will have those as part of the text. (This html generated by
by pdftohtml lacks much of the formatting you get from tex4ht, though,
so it's not likely a good solution in itself.)
4. Parse the html from pdftohtml to find page numbers. For each page number
found, search for the first text occurring on the page. Then search the tex4ht
html for the text found at start of page in the pdftolatex html, and insert the
page counter at the appropriate spot. Repeat for each page marker found in the
Step 4 above is non-trivial (greatly complicated by fact that html markers can
occur anywhere in the text) but I think it's doable in a way that would work
well for most documents, especially for documents that are primarily text (i.e.,
no figures, tables, images).
Does anyone know whether there is a publicly available solution for this? I
would probably write it in Python using the BeautifulSoup html api; I wonder
whether something like this is already available on github or elsewhere.
Or maybe there actually is some way to get tex4ht to (1) generate dvi with
pagination that corresponds to PDF pagination, and (2) include a page counter in
the html that corresponds to the PDF pages.
-- Herb Sitz
to the original tex
The most obvious (to me) ways in which tex4ht changes formatting is that it
disables headers and footers. If this were all it did then the pagetex4ht
firstes its output
to html versions while there is an "authoritative" pdf version with page
More information about the tex4ht