[luatex] LuaTeX picky about internal PDF encoding, breaks self-hosted embedded documents

Fri Mar 27 12:12:51 CET 2020

On 3/27/2020 10:57 AM, Johannes Hielscher wrote:
> tl;dr: The {filecontents} environment only writes UTF-8 encoded files,
> so differently encoded PDFs cannot be referenced by luatex, being more
> pedantic about PDF byte offsets (xref etc.) than pdftex.
> 
> 
> 
> 
> Hi LuaTeX devs,
> 
> Strange issue here, for which I needed some time to find out where the
> “error” (if any) is. LuaTeX is a lot more pedantic about PDFs adhering
> to standards than pdfTeX. This is not necessary a bad thing, as nobody
> expects broken documents (bad xref tables/stream lengths) to work well
> with any program. I totally don't want to request that luatex imitated
> pdftex's liberal interpreter (don't make it too easy for folks like me
> to manually edit PDF files). But I seemingly found a corner case where
> this indeed makes a difference, and I just want to assure that I'm not
> totally on the wrong track with that:
> The {filecontents} environment “embeds” plaintext documents into LaTeX
> and writes them into a new file. I tried to use this for shipping some
> self-contained PDFs for within the document. This did work for pdftex,
> but not for luatex. Turned out that {filecontents} always writes UTF-8
> files, but in the copy-pasted (unpacked, so largely ASCII) PDFs, there
> is an "%âãÏÓ" encoding safeguard (?) comment (second line, see attach-
> ment). This has a different length for UTF-8 than luatex expected from
> the iso8859-1 original internal encoding of the PDF file (that's, just
> to make things worse, invisible in some “intelligent” editors and diff
> tools). So, LuaTeX will die with:
> 
> internal error: unknown image type
> !  ==> Fatal error occurred, no output PDF file produced!
> 
> Since there is no way (at least I didn't come up with one) to manually
> specify the output encoding of {filecontents}, or to trick the PDF in-
> put drivers of luatex into reading PDFs with a different encoding than
> usual, this makes the embedding self-hosted PDFs in LuaTeX impossible,
> given that they have been created in another encoding than UTF-8. Most
> probably there are other cases of encoding-sensitive data that someone
> might embed via {filecontents} as it has worked in pdftex for ages (at
> least in conjunction with filecontents.sty).
> 
> Find attached a tarball with a MWE, with PDFs for two \includegraphics
> and the third one created during translation of the document. Obvious-
> ly, pdftex can cope with the “right” latin1 and the “wrong” UTF-8 PDFs
> but luatex cannot. Which engine is closer to “ideal” behaviour? Or did
> I overlook something important?
these are two separate issues:

- The utf8 pdf file is wrong in the sense that the xref table is made 
for single byte characters, afaiks it counts each multibyte utf 
character as one byte. The pdf library in luatex assumes a correct xref 
table and does no magic in reconstructing (read: gambling). If you want 
bad files to be read you can consider feeding them into some external 
program that fixes them.

- When you embed some pdf stream in the source file you depend on your 
macro package for dealing with how that input results in something 
useable. Luatex is an utf engine and assumes utf input. I don't knwo 
what that enviromnent does but that's the level one had to deal with it 
as the pdf library is not involved in that.

Hans


-----------------------------------------------------------------
                                           Hans Hagen | PRAGMA ADE
               Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
        tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
-----------------------------------------------------------------