[pdftex] [Tex] very subtle endobj bug in latest pdftex

Sat Jul 26 20:51:59 CEST 2014

Hi Reinhard, and others.

Sorry to have not replied sooner.
Too busy with teaching/exams/marking etc.
then an overseas trip + conferences.

TUG 2014 is next week, so perhaps this a good time
to address some of those issues with PDF/A, etc.

On 28/06/2014, at 3:26 PM, Reinhard Kotucha wrote:

> On 2014-06-24 at 09:10:47 +1000, Ross Moore wrote:
> 
>> But Reinhard makes an interesting point.
>> Maybe if I use Acrobat Pro to Save As...  PDF/A then the
>> line-ends will be changed from \r  to \n  and this
>> problem can be avoided?

> 
> Hi Ross,
> did you consider to try Ghostscript?  Despite of its name, ps2pdf
> can read PDF files.

Thanh seems to have fixed the code in pdftex that this was
related to, so I've no real desire to explore this further.

> 
> ---------------------------------------------------
> #!/bin/sh
> 
> OUTPUTFILE=${1%%.pdf}-icc.pdf
> 
> GS_OPTS = -dPDFA=1 \
>          -dUseCIEColor \
>          -sProcessColorModel=DeviceCMYK
> 
> ps2pdf --gsopts=${GS_OPTS} -o ${OUTPUTFILE} $1
> ---------------------------------------------------
> 

> The preferred version of Ghostscript is 9.10 because the color
> management system is quite new and 9.13 breaks hyperref.

Yeah; that's something that really needs to be addressed.
If Nelson Beebe is in Portland, I'll see what is his take on this.

>> Also, it seems to have changed the colors somewhat.
> 
> In such cases I convert PDF to PostScript (using Ghostscript) and
> reverse-engineer the PostScript file.  Did you encounter this problem
> with JPEG/PNG files or with vector graphics?  If colors are changed in
> vector graphics, could you send me the files?

These were vector graphics, created using pdfTeX then edited
in Adobe Illustrator CS2  (quite old).

> If the files are created by LaTeX, it's helpful to avoid any text
> (fonts) and to add these lines to the preamble:
> 
>  \pdfcompresslevel=0
>  \pagestyle{empty}

Nope; that I cannot do.
The images require text.

I just take a typeset page, opened in Illustrator, and remove
the objects that I definitely don't want (e.g. page numbers,
and surrounding text content).
Other text, whose sizing is important, I shifted into a separate layer.
This layer is marked as hidden, rather than deleted. 

Attached below is an example of the original PDF, created as just described,
and a version in which I used Acrobat Pro 11 to change a font.
Some colours changed too, quite unintentionally. 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Figure3-ai.pdf
Type: application/pdf
Size: 132488 bytes
Desc: not available
URL: <http://tug.org/pipermail/pdftex/attachments/20140726/d8cda87c/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Figure3-Asana-pdfa.pdf
Type: application/pdf
Size: 324856 bytes
Desc: not available
URL: <http://tug.org/pipermail/pdftex/attachments/20140726/d8cda87c/attachment-0003.pdf>
-------------- next part --------------

> 
> BTW, you are providing plenty of files which differ only in a few
> lines.  The ideal solution in respect of maintainability is to derive
> them all from a single .dtx file.  I recently told you that it doesn't
> work because TeX translates the bytes of the BOM to its ^^-syntax.
> 
> Fortunately I met Heiko Oberdiek at the Dante Conference in April and
> he told me that it works if the -8bit option is used, for instance
> 
>  tex -8bit pdfx.ins
> 
> I tried it with my test file and it works like a charm.  Vim displays
> 
>  <?xpacket begin="<feff>" id="W5M0MpCehiHzreSzNTczkc9d"?>
> 
> Without the -8bit option I get
> 
>  <?xpacket begin="^^ef^^bb^^bf" id="W5M0MpCehiHzreSzNTczkc9d"?>

OK.
But this is from the  xmpincl.sty  package, which  pdfx.sty
uses but does not control.
The author is  Maarten Sneep  <sneep at nat.vu.nl> .

> 
> Hence I think that deriving everything from a single .dtx file is
> feasible and I'm conviced that it's easier to maintain this way.

I'm not familiar with all the possibilities with  .dtx  files.
My intention was to just do relatively minimal changes to  pdfx.sty
that would allow me to support features that I use myself but not
affect things that I do not use, so cannot properly test.

More things should be added, including more extensive possibilities
within the Metadata, and getting multiple files from the same .dtx .
But I'll need help to be able to do that properly and/or completely.

> 
> You posted the pdfx package to the list some time ago.  Could you
> provide an update?

Not sure why it didn't make it into  TeXLive 2014.
Presumably because CTAN still has the  04-May-2009  version.

> 
> BTW, I'm glad that all free validators are happy now with the stuff
> I created with LuaTeX.  The only exception is the Pdftron validator
> which complains that "CIDSet is incomplete".  I reverse-engineerd the
> output of LuaTeX and came to the conclusion that it's compliant with
> the standard.

Part of the problem with this game is that there are so many
little details, and no single set of guidelines that cope with
everything; especially within a (La)TeX context.

For example, a few days ago I noticed that in the metadata for
a DOI, I was getting

  10.1007/978-3-319-08434-3\13714

instead of:

  10.1007/978-3-319-08434-3_14

because within \pdfstringdef  \_ becomes octal \137 .
Do we really need to use \pdfstringdef  when specifying
Metadata elements?  
Is it a good idea to use it, ever ?  
Sometimes?  
If so, then  \pdfstringdefDisableCommands  needs setting to catch
character specifications like \_ \% \$ etc.  which authors might 
expect should be supported. 
What other macros are likely to be used, also needing special 
treatment?

Other metadata problems that I've encountered are with
document /Info  items not showing up in the expected place.
e.g.

 I have a PDF document with:

>> 1006 0 obj
>> <<
>> /Title(PDF/A-3u as an archival format for Accessible mathematics)/Author(Ross Moore)/Subject(Using PDF/A-3u to deliver the LaTeX source of mathematical expressions, for Accessibility and other purposes)/Keywords(PDF/A-3u, Accessible mathematics)/Creator(LaTeX with hyperref package)/CreationDate(D:20140714175022-07'00)/ModDate(D:20140714175022-07'00)/Producer(pdfTeX)/Trapped /False /GTS_PDFA1Version (PDF/A-3u:2012)
>> /PTEX.Fullbanner (This is pdfTeX, Version 3.14159265-2.6-1.40.15 (TeX Live 2014/dev) kpathsea version 6.2.0dev)
>> >>
>> endobj
>> 

and

>> trailer
>> << /Size 1007
>> /Root 1005 0 R
>> /Info 1006 0 R
>> /ID [<97AECB41667017AB5E575238987012A3> <97AECB41667017AB5E575238987012A3>] >>
>> 

Yet the  /Author  value does not show up 
(see attached screeenshot)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2014-07-25 at 4.18.21 PM.png
Type: image/png
Size: 115957 bytes
Desc: not available
URL: <http://tug.org/pipermail/pdftex/attachments/20140726/d8cda87c/attachment-0003.png>
-------------- next part --------------

It seems that field is not controlled by the  /Author  entry at all,
but by a Metadata entry of:
  <dc:creator> ... </dc:creator>
in the XMP metadata.

Is this kind of non-obvious association written down anywhere?
If so, where?

OK, here's a possible place:

  http://www.niso.org/apps/group_public/download.php/10256/Z39-85-2012_dublin_core.pdf

>> The Dublin Core Metadata Element Set 
>> 
>> Abstract: Defines fifteen metadata elements for resource description in a cross-disciplinary information environment.
>> 

Presumably all 15 DC elements should be supported.
Currently pdfx.sty  supports just 8 of them.

Most of them show up only when you look at:
  Additional Metadata...  >  Advanced 

So my question above is more about how one knows which Metadata
elements show up in Acrobat's various Metadata panels.
Any pointers?

The Copyright entries are particularly troublesome.
An attached image shows Acrobat's XMP panel:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2014-07-26 at 11.15.40 AM.png
Type: image/png
Size: 118317 bytes
Desc: not available
URL: <http://tug.org/pipermail/pdftex/attachments/20140726/d8cda87c/attachment-0004.png>
-------------- next part --------------

I've been able to feed the  Copyright Notice, using

 <dc:rights><rdf:Alt><rdf:li xml:lang="x-default">\xmpCopyright</rdf:li></rdf:Alt></dc:rights>

but my attempts to set the  Copyright Status  popup,
and the  Copyright Info  field  and URL , using 

  <xmpRights:Marked>True</xmpRights:Marked>
  <xmpRights:UsageTerms> __Copyright__ </xmpRights:UsageTerms>
  <xmpRights:WebStatement>  __URL__ </xmpRights:WebStatement>

fail to validate.
Any ideas?

> 
> Regards,
>  Reinhard

All the best,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-206      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: logo.png
Type: image/png
Size: 5257 bytes
Desc: not available
URL: <http://tug.org/pipermail/pdftex/attachments/20140726/d8cda87c/attachment-0005.png>
-------------- next part --------------