[luatex] LuaTeX picky about internal PDF encoding, breaks self-hosted embedded documents

Fri Mar 27 13:16:04 CET 2020

Am Fri, 27 Mar 2020 12:12:51 +0100
schrieb Hans Hagen <j.hagen at xs4all.nl>:

> On 3/27/2020 10:57 AM, Johannes Hielscher wrote:
> > tl;dr: The {filecontents} environment only writes UTF-8 encoded
> > files, so differently encoded PDFs cannot be referenced by luatex,
> > being more pedantic about PDF byte offsets (xref etc.) than pdftex.
> > 
> > 
> > 
> > 
> > Hi LuaTeX devs,
> > 
> > Strange issue here, for which I needed some time to find out where
> > the “error” (if any) is. LuaTeX is a lot more pedantic about PDFs
> > adhering to standards than pdfTeX. This is not necessary a bad
> > thing, as nobody expects broken documents (bad xref tables/stream
> > lengths) to work well with any program. I totally don't want to
> > request that luatex imitated pdftex's liberal interpreter (don't
> > make it too easy for folks like me to manually edit PDF files). But
> > I seemingly found a corner case where this indeed makes a
> > difference, and I just want to assure that I'm not totally on the
> > wrong track with that: The {filecontents} environment “embeds”
> > plaintext documents into LaTeX and writes them into a new file. I
> > tried to use this for shipping some self-contained PDFs for within
> > the document. This did work for pdftex, but not for luatex. Turned
> > out that {filecontents} always writes UTF-8 files, but in the
> > copy-pasted (unpacked, so largely ASCII) PDFs, there is an "%âãÏÓ"
> > encoding safeguard (?) comment (second line, see attach- ment).
> > This has a different length for UTF-8 than luatex expected from the
> > iso8859-1 original internal encoding of the PDF file (that's, just
> > to make things worse, invisible in some “intelligent” editors and
> > diff tools). So, LuaTeX will die with:
> > 
> > internal error: unknown image type
> > !  ==> Fatal error occurred, no output PDF file produced!
> > 
> > Since there is no way (at least I didn't come up with one) to
> > manually specify the output encoding of {filecontents}, or to trick
> > the PDF in- put drivers of luatex into reading PDFs with a
> > different encoding than usual, this makes the embedding self-hosted
> > PDFs in LuaTeX impossible, given that they have been created in
> > another encoding than UTF-8. Most probably there are other cases of
> > encoding-sensitive data that someone might embed via {filecontents}
> > as it has worked in pdftex for ages (at least in conjunction with
> > filecontents.sty).
> > 
> > Find attached a tarball with a MWE, with PDFs for two
> > \includegraphics and the third one created during translation of
> > the document. Obvious- ly, pdftex can cope with the “right” latin1
> > and the “wrong” UTF-8 PDFs but luatex cannot. Which engine is
> > closer to “ideal” behaviour? Or did I overlook something important?
> >  
> these are two separate issues:
> 
> - The utf8 pdf file is wrong in the sense that the xref table is made 
> for single byte characters, afaiks it counts each multibyte utf 
> character as one byte. The pdf library in luatex assumes a correct
> xref table and does no magic in reconstructing (read: gambling). If
> you want bad files to be read you can consider feeding them into some
> external program that fixes them.
> 
> - When you embed some pdf stream in the source file you depend on
> your macro package for dealing with how that input results in
> something useable. Luatex is an utf engine and assumes utf input. I
> don't knwo what that enviromnent does but that's the level one had to
> deal with it as the pdf library is not involved in that.

You are 100% right. That's why I did not call it a bug in the first
place, because everyone does their job right, and nothing has to be
fixed. I have found this out the hard way, and just wanted to leave
it somewhere: it might be helpful for someone else scratching their
heads about the sparse evidence of pdftex being less pedantic about
buggy PDFs than luatex.
As already stated, no mercy for people who have their PDF encoding/
xref tables not under control, and even a bit less in luatex (which
is not necessarily a bad thing!). Fall-out wrt. hard to detect edge
cases in high-level environments included.

Thanks,
Johannes

> 
> Hans
> 
> 
> 
> -----------------------------------------------------------------
>                                            Hans Hagen | PRAGMA ADE
>                Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
>         tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl
> -----------------------------------------------------------------