[pdftex] experiment with tagged PDF

Wed Apr 30 22:30:17 CEST 2008

On Tue, Apr 29, 2008 at 4:04 AM, Thierry Bouche <
thierry.bouche at ujf-grenoble.fr> wrote:

> Hi Neil, Thanh, & others,
>
> N> For the math part, make sure you tag the math as "formula".  Ideally,
> you
> N> should tag each subexpression with the appropriate MathML element name
> (eg,
> N> "mfrac" for fractions), but at the very least, add a "tex" attribute to
> N> "formula" and include the TeX string.
>
> I think this is really something we are missing today but I am not sure
> I understand the implications: Would this help searching using tex code
> inside the formulas? Would this be solely exposed to nonvisual PDF screen
> reader, which would select what kind of alternative text they consume
> based on a format-type attribute?

>
> In this case, is it foreseen that any tex-aware screen reader will ever
> exist?

I think Paul Topping addressed the above, so I'll just deal with

>
>
> Given that there are no Unicode 3.0 math fonts around (or that
> not all math will be typeset with STIX hopefully anyway...), the
> characters string used to print math glyphs is useless for
> accessibility. Sometimes, the unicode character can be recovered from
> the glyph name in the font, or a ToUnicode if present. But not so often
> in our brave pdftex/CM paradigm.

You probably know more PDF details than I do, so I hope it is not
presumptuous to quote the PDF 1.7 Reference on this (10.71, page 820):

Tagged PDF requires that every character code in a document can be mapped to
> a corresponding Unicode value. Unicode defines scalar values for most of the
> characters used in the world's languages and writing systems, as well as
> providing a private use area for application-specific characters.
> Information about Unicode can be found in the Unicode Standard, by the
> Unicode Consortium (see the Bibliography).

> The methods for mapping a character code to a Unicode value are described
> in Section 5.9.1, "Mapping Character Codes to Unicode Values." Tagged PDF
> producers
> should ensure that the PDF file contains enough information to map all
> character codes to Unicode by one of the methods described there.
>

This is a basic requirement because if the data (characters) is meaningless,
no amount of tagging of structure will make the actual text useful.  If you
do maps the characters to Unicode, then assistive technology, search, and
even copy/paste can use this to information and function properly.  There
have been some efforts at "OCR" of math in PDF, and the lack of this
information has been one of the biggest sources of errors, so even if
tagging isn't done, adding this info is extremely useful for trying to
recover information out of PDF. It seems like Ross Moore has a good solution
for this part and I hope he gets it into the code.

> Does the tagging infrastructure in
> pdftex's patch go as far as trying to match each printed glyph, math or
> text, to a unicode char? Would it allow for using external processes
> such as tralics that would be fed with a constant header, and the tex
> string of the formula, so that it could be possible to add a Formula
> tag with pMathML content and tex source in alt? (which seems to me the
> best we can hope for accessibility and functionality, unless I am
> completely misdirected)?
>
> N> You could also add an "alt" attribute
> N> to "formula" that contains the TeX, but as "alt" is meant to be human
> N> readable, it is questionable whether TeX is really appropriate there.
>
> Indeed, to me (and most working mathematicians), the tex code is
> precisely the most portable, readable, useful fallback textual version
> for a math formula. It is even what we'd dream to copy-paste from a PDF
> (or HTML) with our today's working environment!
>
> Remember Knuth said 'math coding in tex is like telling formulas with a
> colleague over the phone'?
>
> So putting the tex code in alt is not necessarily appropriate to
> anyone, but it is the only fully textual human-readable format bearing
> unambiguous math of any level (up to author's macros)...
>
> Putting it in the alt field is certainly better than nothing.  As a TeX
user, it is pretty easy to *visually *read and understand (if it is short).
But for someone not familiar with TeX, it is not so obvious, and listening
to a screen reader say "backslash f r a c open curly brace x close curly
brace backslash over open curly brace 2 a close curly brace" for
"\frac{x}{2a}" is not really useful.  Some people have come up with attempts
to try and make it sound more friendly by mapping "\frac" to "fraction" in a
screen reader, but that still leaves a lot of room for improvement.  Because
of this, it is problematic to recommend that the textual description of the
math aimed at those who can't see the math be TeX.  If the TeX is present
somewhere, at least a machine can understand it modulo some restrictions
(eg, it uses only predefined macros of major packages and doesn't drop into
text mode, etc), so it can be processed into something more usable.

Neil Soiffer
Senior Scientist
Design Science, Inc.
www.dessci.com
~ Makers of Equation Editor, MathType, MathPlayer and MathFlow ~