[pdftex] Consider removing dependence of PDF ID field on current directory name
Anders Kaseorg
andersk at mit.edu
Sat Sep 2 07:52:43 CEST 2017
For the Debian Reproducible Builds effort, I’ve been debugging the
nondeterministic behavior of pdftex when invoked by dblatex. I think the
last remaining issue involves the PDF ID field, which pdftex generates by
hashing the current time and the full path to the input file (function
printID). Previous discussion
(https://tug.org/pipermail/pdftex/2015-May/008940.html,
https://tug.org/pipermail/pdftex/2015-July/008952.html) has led to support
for the SOURCE_DATE_EPOCH environment variable, which nicely controls the
time nondeterminism. This leaves the output depending on the input path.
For many packages that’s sufficient as Debian does not (yet?) require
determinism under build path variation in its definition of
reproducibility. However, dblatex invokes pdflatex on generated input
within a randomly named temporary directory. That makes it hard for
packages using dblatex to build reproducibly, even when the main build
path is fixed, without resorting to per-package patches to remove the ID
field.
Earlier it was mentioned that the algorithm used by pdftex’s printID was
inspired by the section “File Identifiers” in the PDF Reference, which
suggests hashing the time, pathname, file size, and document information
dictionary. However, a note in an appendix makes it clear that the
particular algorithm is unimportant:
“Note that the calculation of the file identifier need not be
reproducible; all that matters is that the identifier is likely to be
unique. For example, two implementations of this algorithm might use
different formats for the current time, causing them to produce different
file identifiers for the same file created at the same time, but the
uniqueness of the identifier is not affected.”
(https://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_1-7.pdf,
Appendix H, implementation note 163)
With that in mind, could printID be changed to avoid depending on the
current directory name, either by default, or if the default won’t be
changed, then perhaps just when a reproducible build has been requested
via the presence of SOURCE_DATE_EPOCH?
Anders
More information about the pdftex
mailing list