[pdftex] experiment with tagged PDF

Ross Moore ross at ics.mq.edu.au
Tue Apr 29 21:51:53 CEST 2008

Hi Thierry, Thanh, and others,

On 29/04/2008, at 9:04 PM, Thierry Bouche wrote:
> Hi Neil, Thanh, & others,

> Given that there are no Unicode 3.0 math fonts around (or that
> not all math will be typeset with STIX hopefully anyway...), the
> characters string used to print math glyphs is useless for
> accessibility. Sometimes, the unicode character can be recovered from
> the glyph name in the font, or a ToUnicode if present. But not so  
> often
> in our brave pdftex/CM paradigm.

> Indeed, to me (and most working mathematicians), the tex code is
> precisely the most portable, readable, useful fallback textual version
> for a math formula. It is even what we'd dream to copy-paste from a  
> (or HTML) with our today's working environment!

I'm working on exactly this approach.
Take a look at these PDF documents:


The fonts are just the traditional CM and AMS fonts
for the text and mathematics.
But try copy/paste of any of the text containing mathematics.

In the *-cmap.pdf  you should get Unicode characters, so you'll
need to paste into an editor that supports this, via UTF8 say.

In the *-mmap.pdf  you should get the TeX macro name of each
mathematical character or TeX construction.

The result can depend upon the PDF browser that is used for
the Copy part, as well as the rich-text capabilities of the
editor into which you Paste. For example, the attached image
shows what I get in the mmap case, using 2 different browsers.

-------------- next part --------------

This is implemented by simply including appropriately constructed
CMap resources using the /ToUnicode  hook.
Since it acts at just the character level, it isn't as useful
as tagging, which should handle each formula or snippet of mathematics
by supplying a textual representation such as MathML and/or TeX source.
(Copy/paste doesn't tructure such as super

It would be nice to have invisible begin/end-math delimiters that
are included in the copy/paste process. These could then be mapped
to $s, or \(....\)  and  \[....\] strings, to give extra usefulness.

> Remember Knuth said 'math coding in tex is like telling formulas  
> with a
> colleague over the phone'?
> So putting the tex code in alt is not necessarily appropriate to
> anyone, but it is the only fully textual human-readable format bearing
> unambiguous math of any level (up to author's macros)...

Sure; that's what LaTeX2HTML has been doing, with web-pages,
for the past decade.

> So many questions!

At least there is now the possibility of some useful answers.

> Th.



Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114

More information about the pdftex mailing list