[luatex] LuaTeX picky about internal PDF encoding, breaks self-hosted embedded documents

Fri Mar 27 10:57:36 CET 2020

tl;dr: The {filecontents} environment only writes UTF-8 encoded files,
so differently encoded PDFs cannot be referenced by luatex, being more
pedantic about PDF byte offsets (xref etc.) than pdftex.

Hi LuaTeX devs,

Strange issue here, for which I needed some time to find out where the
“error” (if any) is. LuaTeX is a lot more pedantic about PDFs adhering
to standards than pdfTeX. This is not necessary a bad thing, as nobody
expects broken documents (bad xref tables/stream lengths) to work well
with any program. I totally don't want to request that luatex imitated
pdftex's liberal interpreter (don't make it too easy for folks like me
to manually edit PDF files). But I seemingly found a corner case where
this indeed makes a difference, and I just want to assure that I'm not
totally on the wrong track with that:
The {filecontents} environment “embeds” plaintext documents into LaTeX
and writes them into a new file. I tried to use this for shipping some
self-contained PDFs for within the document. This did work for pdftex,
but not for luatex. Turned out that {filecontents} always writes UTF-8
files, but in the copy-pasted (unpacked, so largely ASCII) PDFs, there
is an "%âãÏÓ" encoding safeguard (?) comment (second line, see attach-
ment). This has a different length for UTF-8 than luatex expected from
the iso8859-1 original internal encoding of the PDF file (that's, just
to make things worse, invisible in some “intelligent” editors and diff
tools). So, LuaTeX will die with:

internal error: unknown image type
!  ==> Fatal error occurred, no output PDF file produced!

Since there is no way (at least I didn't come up with one) to manually
specify the output encoding of {filecontents}, or to trick the PDF in-
put drivers of luatex into reading PDFs with a different encoding than
usual, this makes the embedding self-hosted PDFs in LuaTeX impossible,
given that they have been created in another encoding than UTF-8. Most
probably there are other cases of encoding-sensitive data that someone
might embed via {filecontents} as it has worked in pdftex for ages (at
least in conjunction with filecontents.sty).

Find attached a tarball with a MWE, with PDFs for two \includegraphics
and the third one created during translation of the document. Obvious-
ly, pdftex can cope with the “right” latin1 and the “wrong” UTF-8 PDFs
but luatex cannot. Which engine is closer to “ideal” behaviour? Or did
I overlook something important?

Best,
Johannes
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mwe_pdf_encoding.tar
Type: application/x-tar
Size: 10240 bytes
Desc: not available
URL: <https://tug.org/pipermail/luatex/attachments/20200327/68fc7572/attachment.tar>