[accessibility] Current packages and methods of generating tagged PDF from LaTeX

Jason White jason at jasonjgw.net
Fri Jun 28 18:34:29 CEST 2019

>From a user’s standpoint, reading the LaTeX source and reading an HTML or (in the future, preferably a tagged) PDF rendering are two quite distinct activities. It seems to me that there’s a need to support both.


This is true not only for proofreading one’s own manuscripts, but also for reading the work of others. Much of the journal literature consists, unfortunately, of untagged PDF at the moment – either scanned images, or unstructured text


I am also of the view that there is a need to support multiple output formats from LaTeX input, including HTML/CSS/MathML/SVG as well as accessible PDF.


The suggestion of extracting the document tree structure (in effect, elements and attributes), text and images from tagged PDF and then converting it to other formats would be worth further exploration. In effect, the tagged PDF would hold a structured rendering of the original LaTeX document, suitable for conversion to other formats as needed – as suggested earlier in this discussion.


Developers of tools such as TeX4HT might well be interested. When last I checked, they were using DVI files as their intermediate format.


The problem of mathematical notation merits a separate discussion. The minimal approach would be to embed the original LaTeX source of the expression as alternative text in the PDF. So far as I know, there’s no analogue of MathML tags that PDF reading applications can process to make mathematics accessible.


Then there are chemistry, music, and probably other notations to consider as well – just as is true on the web in general. So the solutions ought to be generalizable, so far as is feasible.


From: accessibility <accessibility-bounces+jason=jasonjgw.net at tug.org> On Behalf Of easjolly at ix.netcom.com
Sent: Friday, June 28, 2019 12:00 AM
To: accessibility at tug.org
Subject: Re: [accessibility] Current packages and methods of generating tagged PDF from LaTeX


I don’t know very much about the technical aspects of this specific problem but I would like to present some of my thoughts that might lead to additional discussion.


My experience in dealing with creating and interconverting file formats in other contexts is that the standard advice is to first create a so-called neutral file that may or may not be one of the target formats.  Then each target format is produced from the neutral file. This approach is of course intended to reduce the number of converters needed and to simplify adding new target formats at a later time.


I understand that here we are addressing the issue that a large per cent of documents are currently authored in some flavor of LaTeX and that this situation is unlikely to change. And it seems so far that tagged PDF is being considered as the best neutral format.  If that is correct then one question is whether it is really the best option?


One item that got me to thinking about other options is the solution described in the article “Creating PDF documents with accessible formulae” by Ahmetovic, et. al. in TUGboat 39(3) p 224. This article proposes a method for retaining in a created PDF document a hidden copy the LaTeX source associated with each math expression that’s in the document so it is available for generating accessible formulae.


Now that digital storage seems virtually unlimited via the cloud, I wonder if another option for retaining the source would be some protocol for separate storage of the entire LaTeX source used to create a PDF file? This would seem to at the very least have the advantage that it could be done quickly and independently from improving tagging or developing a different neutral file format.  


In the same issue of TUGboat on p. 173 there is an article by Shultz and Koch on file encoding and TEXShop. It points out the need to let LaTeX or other typesetting engine know which file encoding was used. Could something similar be used to tell a renderer where to access/store its source file? Of course the rendered document would also need a copy of the storage location. Could this be part of its metadata?


Best wishes,

Susan Jolly



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/accessibility/attachments/20190628/132802db/attachment.html>

More information about the accessibility mailing list