[tex-live] Making texts externally replaceable in PDFs, e.g. with sed(1)

Fri Dec 14 18:03:01 CET 2018

Hi Phil,

Am 2018-12-14 um 17:43 schrieb Philip Taylor:
> 
> 
> Osipov, Michael wrote:
>> Hi folks,
>>
>> we are using XeTeX 3.14159265-2.6-0.99999 (TeX Live 2018) on Windows 
>> and FreeBSD.
>>
>> After studying the PDF specification [1] and how XeLaTeX and xdvipdfmx 
>> work with Unicode (from PDF samples), I believe that my request is 
>> (virtually) impossible.
>> I'd be happy if someone could either confirm this or prove me wrong.
>>
>> Task: We are producing PDFs on our server (from LaTeX source) for the 
>> client which takes the PDF and uploads it to another service which may 
>> replace placeholders, e.g., %DOCID% with the actual document ID in the 
>> target system. [Remainder snipped[
> 
> Well.  I used the following source :
> 
>> Now is the time \%DOCID\% for all good men
>>
>> to come to the aid of the party.
>>
>> \end
>>
> 
> which generated the attached PDF (Test.pdf).  I then opened "Test.pdf" 
> in Adobe Acrobat Pro DC, selected "Tools  / Edit PDF", and replaced 
> "%DOCID%" by "The quick brown fox jumps right over the lazy dog's 
> back".  The text re-flowed as one would hope.  If Adobe Acrobat Pro DC 
> can do it, then it clearly can be done; all that is needed is to write 
> code to emulate  Adobe Acrobat Pro DC's behaviour w.r.t. editing text.

thanks for your quick reply, but neither of will work and suffers from 
conceptual misunderstanding.

Look closely at the Test.pdf, it is compressed. Cannot be processed with 
sed(1). Even if you decompress it, it contains a Type 1 font which has 
no /Encoding or /Ording. /FontFile3 references 10 0 obj which contains 
the entire font. This does not resemble my Unicode case at all.
The content is in 5 0 obj:
>  q 1 0 0 1 72 769.89 cm BT /F1 9.9626 Tf 19.925 -9.963 Td[(No)27(w)-332(is)-333(the)-334(time)-333(%DOCID%)-333(for)-334(all)-333(go)-28(o)-28(d)-333(men)-333(to)-334(come)-333(to)-333(the)-334(aid)-333(of)-333(the)-334(part)27(y)83(.)]TJ 211.584 -654.747 Td[(1)]TJ ET Q

As for Adobe Acrobat Pro DC: It is a fully-fledged PDF suite which knows 
the format best and operates on an abstract memory representation of the 
PDF while sed(1) does operate on pure bytes.

The operations this suite performing doing can only be achieved with a 
library like iText of PDFBox [1]. this isn't a route I really want to 
go. Especially because the post-processing on the target side is out of 
my hands at all.

Regards,

Michael

[1] https://stackoverflow.com/q/52027733/696632