[tex-live] Making texts externally replaceable in PDFs, e.g. with sed(1)

Osipov, Michael michael.osipov at siemens.com
Fri Dec 14 16:50:17 CET 2018


Hi folks,

we are using XeTeX 3.14159265-2.6-0.99999 (TeX Live 2018) on Windows and 
FreeBSD.

After studying the PDF specification [1] and how XeLaTeX and xdvipdfmx 
work with Unicode (from PDF samples), I believe that my request is 
(virtually) impossible.
I'd be happy if someone could either confirm this or prove me wrong.

Task: We are producing PDFs on our server (from LaTeX source) for the 
client which takes the PDF and uploads it to another service which may 
replace placeholders, e.g., %DOCID% with the actual document ID in the 
target system. So the PDF has to be uncompressed (xdvipdfmx -z 0) and 
has to contain literal strings "(%DOCID%)Tj" or "[(%DOCID%)]TJ" 
according  to the PDF spec.

XeLaTeX produces the following:
> BT /F1 5.9776 Tf -40.819 -756.627 Td[<00270052004e00580050004800510057005100580050005000480055>]TJ /F1 9.9626 Tf 0 -11.955 Td[<0008002700320026002c00270008>]TJ ET

> begincmap
> /CMapName /C:-WINDOWS-fonts-siemens_global_roman.ttf,000-UTF16 def
> /CMapType 2 def
> /CIDSystemInfo <<
>   /Registry (Adobe)
>   /Ordering (UCS)
>   /Supplement 0
>>> def
> 1 begincodespacerange
> <0000> <FFFF>
> endcodespacerange
> 13 beginbfchar
> <0008> <0025>
> <0017> <0034>
> <001B> <0038>
> <002A> <0047>
> <002C> <0049>
> <002E> <004B>
> <0032> <004F>
> <0033> <0050>
> <0039> <0056>
> <005C> <0079>
> <005D> <007A>
> <008B> <00A9>
> <00B3> <2014>
> endbfchar
> 5 beginbfrange
> <0010> <0015> <002D>
> <0024> <0028> <0041>
> <0035> <0037> <0052>
> <0044> <0053> <0061>
> <0055> <0059> <0072>
> endbfrange
> endcmap

So it writes hexadecimal character codes which map to Unicode points in 
our true type font Siemens Global.

So for a sed(1)-based postprocessor it is virtually impossible to map 
"<0008002700320026002c00270008>" to "%DOCID%" w/o analyzing the PDF objects.

Requesting XeLaTex to produce
> BT /F1 5.9776 Tf -40.819 -756.627 Td[<00270052004e00580050004800510057005100580050005000480055>]TJ /F1 9.9626 Tf 0 -11.955 Td[(%DOCID%)]TJ ET

will not work because the /ToUnicode cmap does not have a character 
mapping from the literal "%" (etc.) to the corresponding Unicode point. 
Especially because the to be replaced chars in the real document ID 
would need to be in the bfchar listing.

Having procuded a capable, corresponding PDF with PDF XChange printer 
driver embedded the Siemens Global twice. As Identity-H encoding 
(subset) and with WinAnsiEncoding (completely). Without the char code to 
glyph mapping it seems to be possible. So the approach has to be a 8-bit 
font encoding:
> /Type /Font
> /Subtype /TrueType
> /BaseFont /SiemensSansGlobal-Regular
> /FirstChar 32
> /LastChar 220
> /Encoding /WinAnsiEncoding

This is something which is impossible because of XeLaTeX's Unicode 
nature. It will always use CID with Indentity-H and UCS ordering.

This will get even more complicated if glyph spacing is involved.

I'd be happy if someone could drop a comment or two on the issue.

Regards,

Michael

PS: I haven't looked into the pdfx package yet how this could solve the 
issue with XeLaTeX. Plus, my PDF spec and LaTeX knowledge is very little.

[1] 
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf


More information about the tex-live mailing list