[pdftex] Consider removing dependence of PDF ID field on current directory name

Anders Kaseorg andersk at mit.edu
Sat Sep 2 07:52:43 CEST 2017


For the Debian Reproducible Builds effort, I’ve been debugging the 
nondeterministic behavior of pdftex when invoked by dblatex.  I think the 
last remaining issue involves the PDF ID field, which pdftex generates by 
hashing the current time and the full path to the input file (function 
printID).  Previous discussion 
(https://tug.org/pipermail/pdftex/2015-May/008940.html, 
https://tug.org/pipermail/pdftex/2015-July/008952.html) has led to support 
for the SOURCE_DATE_EPOCH environment variable, which nicely controls the 
time nondeterminism.  This leaves the output depending on the input path.

For many packages that’s sufficient as Debian does not (yet?) require 
determinism under build path variation in its definition of 
reproducibility.  However, dblatex invokes pdflatex on generated input 
within a randomly named temporary directory.  That makes it hard for 
packages using dblatex to build reproducibly, even when the main build 
path is fixed, without resorting to per-package patches to remove the ID 
field.

Earlier it was mentioned that the algorithm used by pdftex’s printID was 
inspired by the section “File Identifiers” in the PDF Reference, which 
suggests hashing the time, pathname, file size, and document information 
dictionary.  However, a note in an appendix makes it clear that the 
particular algorithm is unimportant:
“Note that the calculation of the file identifier need not be 
reproducible; all that matters is that the identifier is likely to be 
unique.  For example, two implementations of this algorithm might use 
different formats for the current time, causing them to produce different 
file identifiers for the same file created at the same time, but the 
uniqueness of the identifier is not affected.”
(https://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_1-7.pdf, 
Appendix H, implementation note 163)

With that in mind, could printID be changed to avoid depending on the 
current directory name, either by default, or if the default won’t be 
changed, then perhaps just when a reproducible build has been requested 
via the presence of SOURCE_DATE_EPOCH?

Anders




More information about the pdftex mailing list