[XeTeX] \(pdf)mdfivesum

Wed Jul 8 02:10:24 CEST 2015

Sorry to add yet another a voice to the discussion.  I agree with
Apostolos Syropoulos that the adding primitives to XeTeX should be
limited, but I disagree on other points.

On 7/2/15, Apostolos Syropoulos <asyropoulos at yahoo.com> wrote:
> So someone will step in and implement this primitive but then we
> will realize we need another primitive to handle the more advanced
> sha256. Programming languages have libraries for this and they do
> not modify the language to handle every new feature. So the best
> solution is to introduce some library mechanism that would make
> it possible to introduce new commands without affecting the kernel.

The difference is that programming languages provide access to the
filesystem and everything else the programmer might need.  Then
libraries can put these raw features together in nice abstractions.
XeTeX currently provides no safe* way to do various operations that
pdfTeX/LuaTeX allow.  The obstruction to implementing the md5 hash in
a package is that XeTeX provides no way to access bytes in a file: it
can only read files encoded in utf8.

*By "safe" I am excluding shell-escape, which would allow for
arbitrary code execution.

If it was possible to read a file's bytes, then implementing md5,
sha1, sha256 would be straightforward.  For this, I suggest the pdfTeX
primitive \pdffiledump, which expands to a hexadecimal representation
of some bytes in a file.  An identical primitive could safely be added
to XeTeX.  It would allow to compute the md5 hash of a file while
being sure that this is indeed the same file as what XeTeX would \read
or \input : the PerlTeX approach cannot ensure this, as the path
searched by (Xe)TeX is different from that searched by Perl.  For
definiteness, here is the description of \pdffiledump from the pdfTeX
manual.

\pdffiledump [ offset ⟨number⟩ ] [ length ⟨number⟩ ] ⟨general text⟩
(expandable)
Expands to the dump of the file ⟨general text⟩ in uppercase hexadecimal
format (same as \pdfescapehex ), starting at offset ⟨number⟩ or 0 with
length ⟨number⟩ , if given. The first ten bytes of the source of this manual
are 2520696E746572666163 . The primitive was introduced in pdfTEX 1.30.0.

Adding this primitive fixes the question of md5, sha1, sha256 hashes,
of reading back in _exactly_ a file that has been written by XeTeX,
and also IIRC of finding the bounding box in some eps images.  For
other "missing" primitives one should evaluate whether they are
implementable as library code, and how useful they are.

\pdfcreationdate : not sure how useful it is, perhaps for compliance
to some standards.

\pdfescapestring, \pdfescapename, \pdfescapehex, \pdfunescapehex :
implementable in TeX, and anyways it is unclear how chars >127 should
be treated

\pdfuniformdeviate, \pdfnormaldeviate, \pdfrandomseed,
\pdfsetrandomseed : pseudo-randomness is implementable in TeX, but
perhaps such better random numbers are needed.  It seems very
specific.

\pdffilemoddate not strictly necessary, \pdffilesize (might be
necessary for \pdffiledump), and \pdfmdfivesum (see this whole
discussion thread)

So all in all, I'd be in favor of adding \pdffilesize and \pdffiledump
into XeTeX, and leaving other primitive out, including the mdfive one.

Regards,

Bruno