[pdftex] Tagged PDF generated from LaTeX using unpatched pdftex
Ross Moore
ross.moore at mq.edu.au
Tue Nov 22 03:52:22 CET 2016
Hi all,
I’ve recently returned to tackle the task of generating Tagged PDF using pdftex,
in particular for PDF 2.0, PDF/A-1a and PDF/UA format specifications.
This is using ordinary pdftex, not the one with extra primitives specially for the
tagging structures.
So far I’ve had more success than I’d originally thought possible.
The attached document below is an example that fully conforms to PDF/A-1a.
It also passes all the Accessibility tests in Acrobat Pro DC.
However there is some smallish issues that would be good to have implemented.
Using \pdfliteral for the tagging, to insert material before and after textual content,
the page-content stream looks like:
>>> 36 0 obj
>>> <<
>>> /Length 2718
>>> >>
>>> stream
>>> 1 0 0 1 108.737 686 cm
>>> /T <</MCID 0 >> BDC
>>> 1 0 0 1 -108.737 -686 cm
>>> BT
>>> /F15 10.9091 Tf 108.737 686 Td [(Here)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 25.788 0 Td [(is)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 10.97 0 Td [(a)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 9.091 0 Td [(paragraph)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 52.182 0 Td [(b)28(y)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 15.151 0 Td [(itself.)]TJ
>>> ET
>>> 1 0 0 1 252.586 686 cm
>>> EMC
>>> …
Note
1. the use of interword spaces between words;
2. the coordinate space adjustments prior to tags an BT textual content:
1 0 0 1 108.737 686 cm
/T <</MCID 0 >> BDC
1 0 0 1 -108.737 -686 cm
BT
Further down this becomes excessive, for just a single line of text with styling:
>>> 1 0 0 1 -179.576 -10.095 cm
>>> /T <</MCID 4 >> BDC
>>> 1 0 0 1 -108.737 -563.655 cm
>>> BT
>>> /F15 10.9091 Tf 108.737 563.655 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
>>> ET
>>> 1 0 0 1 224.889 563.655 cm
>>> EMC
>>> /T <</MCID 5 >> BDC
>>> 1 0 0 1 -224.889 -563.655 cm
>>> BT
>>> /F16 10.9091 Tf/F17 1 Tf( )Tj/F16 10.9091 Tf 224.889 563.655 Td [(b)-32(old)]TJ/F17 1 Tf( )Tj/F16 10.9091 Tf 28.227 0 Td [(text)]TJ
>>> ET
>>> 1 0 0 1 275.245 563.655 cm
>>> EMC
>>> /T <</MCID 6 >> BDC
>>> 1 0 0 1 -275.245 -563.655 cm
>>> BT
>>> /F15 10.9091 Tf/F17 1 Tf( )Tj/F15 10.9091 Tf 275.245 563.655 Td [(.)]TJ
>>> ET
>>> 1 0 0 1 283.124 563.655 cm
>>> EMC
The length of the output can be reduced (by approx 15–20%) using \pdfliteral direct ….
But there is a drawback, since BT … ET and BDC … EMC operators
must be correctly nested, else the PDF is malformed.
viz.
>>> 36 0 obj
>>> <<
>>> /Length 2232
>>> >>
>>> stream
>>> …
>>> ...
>>> BT
>>> /F17 1 Tf 108.737 563.655 Td [( )]TJ
>>> /T <</MCID 4 >> BDC
>>> /F15 10.9091 Tf 0 0 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
>>> EMC
>>> /T <</MCID 5 >> BDC
>>> /F16 10.9091 Tf/F17 1 Tf( )Tj/F16 10.9091 Tf 27.333 0 Td [(b)-32(old)]TJ/F17 1 Tf( )Tj/F16 10.9091 Tf 28.227 0 Td [(text)]TJ
>>> EMC
>>> /T <</MCID 6 >> BDC
>>> /F15 10.9091 Tf/F17 1 Tf( )Tj/F15 10.9091 Tf 22.129 0 Td [(.)]TJ
>>> EMC
>>> ET
Note here that \pdffakespace is used immediately before the first \pdfliteral direct
otherwise one gets incorrect nesting as:
>>> /T <</MCID 4 >> BDC
>>> BT
>>> /F15 10.9091 Tf 0 0 Td [(And)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 23.94 0 Td [(another)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 40.03 0 Td [(with)]TJ/F17 1 Tf( )Tj/F15 10.9091 Tf 24.849 0 Td [(some)]TJ
>>> EMC
>>> /T <</MCID 5 >> BDC
>>> ...
However that “fake space” is viewed as content, for Accessibility purposes.
It must therefore be within tags — but that cannot be achieved this way.
So here are my requests.
1. please add a new mode to \pdfliteral
e.g. \pdfliteral text {….}
which checks whether we have pdf_doing_text as true.
If so, just do what \pdfliteral direct does;
otherwise do
pdf_print_ln("BT");
pdf_doing_text := true;
then place the contents literally.
When used correctly, textual content would follow, without needing to change pdf_doing_text
nor include the initial “BT”.
Presumably there’ll need to be an adjustment to
procedure pdf_begin_text; {begin a text section}
to not do pdf_print_ln("BT”); when pdf_doing_text is already true.
2.
It would be great to be able to do away with the \pdfinterwordspaceon/off for every word.
That is, generate shorter output with explicit spaces (when the font has it in slot 32 ) such as:
/T <</MCID 4 >> BDC
/F15 10.9091 Tf 0 0 Td [(And )<num>(another )<num>(with )<num>(some )]TJ
EMC
where each <num> is calculated using the width of the space character in the font.
Not only does this reduce the (uncompressed) size considerably, but it would also
allow for the “Reflow” effect in Adobe Reader and Acrobat Pro (and other ?) PDF readers.
All the best,
Ross
Dr Ross Moore
Mathematics Dept | Level 2, S2.638 AHH
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 | F: +61 2 9850 8114
M:+61 407 288 255 | E: ross.moore at mq.edu.au
http://www.maths.mq.edu.au
[cid:75aa1ef5-7de8-4a72-b53d-a5ccf4344a69 at ausprd01.prod.outlook.com]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.
This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/pdftex/attachments/20161122/937ce490/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4605 bytes
Desc: image001.png
URL: <http://tug.org/pipermail/pdftex/attachments/20161122/937ce490/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tag-sample-valid.pdf
Type: application/pdf
Size: 49877 bytes
Desc: tag-sample-valid.pdf
URL: <http://tug.org/pipermail/pdftex/attachments/20161122/937ce490/attachment-0001.pdf>
More information about the pdftex
mailing list