[pdftex] pdftex deterministic?

Tue Apr 3 12:40:07 CEST 2007

On Sun, 2007-04-01 at 21:46 +0200, Thanh Han The wrote:
> I have made a script to compare 2 pdfs, which I use to
> quickly test whether a new version of pdftex produces
> different output from the previous version. The script use
> gs or pdftoppm to generate all pages of the pdfs as bitmaps,
> then compare them page by page using diff. If they are
> different, they are further compared by the 'compare' tool
> from image magick. If the difference is larger than a
> threshold, an image showing the differences is shown.
> 
> If anyone is interested in the script I will send it.

Hi Thahn,

i suppose this could be a very time-consuming operation
when you try to compare huge PDFs. In my experience,
even a smart diff (diffing two files excluding known
differencies induced by e.g. the different filename, different IDs etc.)
could be many times longer than having a match on checksums
(by a factor that could be > 20).

Using checksums is the cheapest way of comparing two files
(even cheaper than a byte comparison) and give you absolute
confidence on the result.

The only drawback is that you have to slightly modify the driver
in order to make this possible (e.g. fixing IDs differences)
but this could be an explicit option given to pdftex to let
it strip runtime info and focalize on the content. For
final production files you have simply to rerun pdftex
with this options turned off.

I ask you this, because in my environment i need to compare
the source LaTeX document with that produced by my production system
and since i need the fastest comparison, i require (by now) ps checksum
equivalence in order to sort out regressions.

I surely need to do the same with PDFs once i move using pdftex
(and i'll highly prefer to have an official pdftex option ;))

Is this feasible or clashes with any PDF specs?

Thanks in advance,

-m

> 
> Thanh
> 
> On Sat, Mar 31, 2007 at 02:04:48PM -0300, George N. White III wrote:
> > On 3/31/07, Geoffrey Alan Washburn <geoffw at cis.upenn.edu> wrote:
> >
> > >         I could swear I had read something about this in the past, but I
> > > couldn't remember the correct keywords to find anything via search.  In
> > > any event, I recently wanted to make some changes to a the source of a
> > > document and to make sure that these changes did not actually affect the
> > > document I tried diffing the before and after PDFs.  Unfortunately,
> > > after some further experimentation it does not seem that even repeated
> > > runs of the same document produce identical output.  Is there any way I
> > > can modify my documents or the parameters to pdftex to produce identical
> > > output on identical inputs?  I realize this very well may not be
> > > possible, and if so, what alternatives do people use in practice?  Thanks!
> >
> > Acroabt has several types of side-by-side comparisons. There are other
> > commercial tools, (one advantage of using .pdf format is that there
> > are lots of people using it, so you can draw on general-purpose tools
> > from outside the TeX community) but I have no experience with them:
> >             <http://www.zizasoft.com>, <http://www.docucomp.com/>
> >
> > Did you try "diff --text" (e.g., with diff from GNU diffutils 2.8.1)?
> >
> > Dates and some generated "ID" are stored in the pdf file so a plain
> > "diff" always says: "Binary files 1/foo.pdf and 2/foo.pdf differ".
> > If you use "diff --text 1/foo.pdf and 2/foo.pdf" you should get
> > something like:
> >
> > 405,406c405,406
> > < /CreationDate (D:20070324132126-03'00')
> > < /ModDate (D:20070324132126-03'00')
> > ---
> > > /CreationDate (D:20070331131842-03'00')
> > > /ModDate (D:20070331131842-03'00')
> > 436c436
> > < /ID [<E1671B8E332FB4F759BF968FAE32724A>
> > <E1671B8E332FB4F759BF968FAE32724A>] >>---
> > > /ID [<FB02C8CC764462DED0414047DF118FBC> <FB02C8CC764462DED0414047DF118FBC>] >>
> >
> > pdftool (from Artifex fitz <http://ccxvii.net/apparition/>) can
> > extract individual objects for analysis after diff identifies a
> > problem.  Another approach is to compare rasterized pages using image
> > differences.
> >
> > It is worth the effort to get pdftool (and apparition works well
> > machines that get bogged down by acroread).   I don't know if any
> > linux distro has binaries, but everything but the jbig2dec library is
> > widely available in linux package form.
> >
> >   -------------- (from the README) ---------------
> > PREREQUISITES
> >
> >  Before compiling Fitz you need to install thirdy party dependencies.
> >
> >    zlib
> >    libjpeg
> >    libpng
> >    freetype2
> >    expat
> >
> >  There are a few optional dependencies that you don't strictly need.
> >  You will probably want the versions that Ghostscript maintains.
> >
> >     jbig2dec
> >     jasper
> >
> >  Fitz uses the Perforce Jam build tool. You need the Perforce version 2.5
> >  or later. Earlier versions (including the FTJam fork) have crippling bugs.
> >  Boost Jam is not backwards compatible. If you do not have a compiled
> >  binary for your system, you can find the Jam homepage here:
> >    <http://www.perforce.com/jam/jam.html>
> >               -------------------------------------------------------------------------------
> >
> > --
> > George N. White III <aa056 at chebucto.ns.ca>
> > Head of St. Margarets Bay, Nova Scotia
> > _______________________________________________
> > pdftex mailing list
> > pdftex at tug.org
> > http://tug.org/mailman/listinfo/pdftex
> _______________________________________________
> pdftex mailing list
> pdftex at tug.org
> http://tug.org/mailman/listinfo/pdftex