[pdftex] does TEX destroy valid PDF syntax ?

hr01 at arcor.de hr01 at arcor.de
Tue Apr 6 18:19:52 CEST 2010


     hi there,

may be a kind soul out there can give me some help or advice w.r.t. a problem i'm having with
TEX while reading certain PDF files. unfortunately, i'm by no means an expert neither for TEX
nor for PDF...

for testing i've already tried to reduce the environment to an absolute minimum. originally
i intended to merge several PDF files to a big one plus also set certain viewing attributes.
however, to reproduce the problem it's sufficient to let TEX read in 1 single small PDF file.
we've done this on LINUX (SUSE) as well as on Windows XP (MIKTEX) with identical results.

i've prepared a single white page which contains just the word "Test" at the top. this page
is scanned with a CANON Lide 700 (grey, 600 dpi) and OCR active (for searching the PDF file).
the PDF output file produced by the CANON software can successfully be viewed and searched 
using ADOBE ACROBAT reader (V9.3.1).

now i'm reading this file with a very small TEX program to produce a new output PDF. this
one can no longer be viewed by ACROBAT due to error #135:

  "dictionary keys must be direct name objects."

it looks as if exactly this error had already been observed by others in 2004 but hasn't yet
been fixed:

  http://groups.google.com/group/comp.text.tex/browse_thread/thread/8ea09b9ce119b00b/9c8b16685e817202?hl=en&ie=UTF-8&oe=UTF-8&q=pdflatex+"invalid+other+resource"#

during processing of the CANON PDF file, TEX issues the following warning:

  pdfTeX warning: pdflatex (file C:/Tmp/CANON_test.pdf): PDF inclusion: invalid
  other resource which is no dict (key 'Subtype', type <stream>); copying it anyway.

however, in my eyes this warning isn't really justified, i can't determine any errors in the
CANON output (see below). when i use 3 other packages from the web for merging PDF files there
are no such warnings. moreover, all corresponding outputs are perfectly readable and searchable
with ACROBAT.

in order to find out exactly why ACROBAT complains i've analyzed the PDF outputs of CANON as
well as the one produced by TEX (i had never before looked into a PDF file, so please forgive 
me my "amateurish" description...).

the problem arises only when the OCR (optical character recognizion) info is included. after
switching off this feature in the CANON scan software all is fine, however, in this case the
output, of course, isn't searchable.

the relevant part in the CANON output PDF looks like this:

---------------
11 0 obj <<
/Type /Page
/MediaBox [ 0 0 595.20 841.80 ]
/Parent 3 0 R
/Resources <<
  /Font << /F3 6 0 R >>
  /XObject << /Obj4 4 0 R >>
  /Subtype 7 0 R
  /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
  >>
/Contents [ 9 0 R 10 0 R 5 0 R ]
>>
endobj

7 0 obj
<< /Length 8 0 R >>
stream
Test

endstream
endobj

8 0 obj
5
endobj 
---------------

the direct name object "/Resources" is defined as a "dictionary" delimited by "<<" and ">>".
inside a dictionary there must be pairs of "key" and "value", where "key" must be direct
name objects (e.g. "/Myobject"). here the 1st key is "/Font"; its value is again a dictionary
"<< /F3 6 0 R >>". 

now comes the interesting part: the value of the key "/Subtype" is a reference to the object
with number 7 ("7 0 R"). object #7 is defined as a stream object which contains the string
"Test" (that's the OCR info necessary for searching).

a stream object must start with a dictionary (here "<< /Length 8 0 R >>"), followed by the
keyword "stream", the characters to be defined and terminated by the keyword "endstream".
"/Length" refers to the small object #8 which defines the length 5 (length of "Test" plus
a linefeed).

in my opinion all appears fine here. now let's look what TEX has done to it; here are the
relevant parts in the output PDF produced by TEX:

---------------
1 0 obj <<
/Type /XObject
/Subtype /Form
/FormType 1
/Resources <<
 /Font << /F3 7 0 R >>
 /XObject << /Obj4 8 0 R >>
 /Subtype << /Length 9 0 R >>
stream
Test
endstream
 /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]
 >>
>>
...
endobj

9 0 obj
5
endobj
---------------

TEX doesn't keep the valid pair of key-value in the /Resources dictionary: while the CANON
PDF contains a correct pair "/Subtype 7 0 R", TEX incorrectly replaces this object by its
definition which destroys the valid dictionary syntax:

---------------
 /Subtype << /Length 9 0 R >>
stream
Test
endtream
---------------

now ACROBAT interprets "<< /Length 9 0 R >>" as the value for key "/Subtype". syntactically
that's still OK but it doesn't make sense; then the next keyword "stream" is again viewed 
as a key in the dictionary, but a key must be a direct name object which is no longer the
case here. at  this point ACROBAT throws the error #135 and can't display the page.

after i know exactly what goes wrong i'm able to repair the corrupt TEX output file by hand
(re-define the value of "/Subtype" to be an object reference). then the error 135 goes away
and the document can be viewed correctly.

i've included a small ZIP file containing the TEX source, the CANON PDF and the TEX output
PDF. the observed behavior should be reproducable.

now my question to you: would you agree that the problem is located within TEX ?  if so,-
what are the chances to have it fixed ?  on the other hand, if you should spot any errors
in the CANON output PDF i'd be happy if you could give me some explanations in order to be
able to effectively request a fix from CANON.

thanks very much for your help and patience !

cheers -

   Herbert






Internet-Tipps für jedermann und jedefrau - jetzt aktuell und kostenfrei auf arcor.de: http://www.arcor.de/rd/footer.inettipps



More information about the pdftex mailing list