[pdftex] pdflatex adds spaces between letters, which makes search impossible

Ross Moore ross.moore at mq.edu.au
Fri Dec 6 03:27:11 CET 2019


Hi Olivier, and Allin

On 6 Dec 2019, at 12:52 pm, Allin Cottrell <cottrell at wfu.edu<mailto:cottrell at wfu.edu>> wrote:

On Fri, 6 Dec 2019, Olivier via pdftex wrote:

Hello,

[sorry, I couldn't find a website where I could search through the past discussions to check if that question was already submitted to the list]

My problem is that PDF files produced by `pdflatex` are not searchable with `mupdf`. Considering such file named "test.pdf", we have the result below, which explains why searching for the string "lorem" fails:

$ mutool draw -F txt test.pdf | head -1
Lor e m

It is observed that spaces are added arbitrarily between the letters.

TeX normally does not include space characters between words.
PDF consuming software must use heuristics to deduce word boundaries on such PDFs.
So depending on what software you use, you can get different results ...


But who's adding them?

Since you are seeing some spaces with  mutool , it begs a question:

  Does  mutool  have any parameters which affect how much space should be considered as an interword gap ?

Maybe the lack of explicit spaces causes it to find the largest gaps between letters, and interpret those as interword???



I don't have the "lmodern" (font) package installed, but if I run pdflatex on a PDF generated as you describe except for the omission of "\usepackage{lmodern}", then


 … as here:

pdftotext test.pdf - | grep lorem

displays the expected results, with intact "lorem”s.

(I see the same, with  lmodern )


However, it *is* actually possible to make  pdfLaTeX  include (faked) inter-word spaces,
using the primitive command:

     \pdfinterwordspaceon

Try this with your example, before testing again with  mutool .
Does it make a difference?

By “faked”, the spaces have almost 0 width (roughly 10^{-5} points) on the PDF page,
so they have no noticeable effect on the typeset layout.
But when text is extracted they come out as a real space.



BTW, it has a companion  \pdfinterwordspaceoff .

Both of these are absolutely *vital* if you want to produce Archivable and Accessible PDFs,
which validate against ISO standards: PDF/A and PDF/UA.

It takes very tricky programming to turn on/off generation of these (fake) interword spaces
at exactly the correct places in the output, so as to satisfy the PDF/UA standard.
(I spoke about this issue at this year’s TUG meeting.)



Allin Cottrell



Hope this helps.

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore at mq.edu.au<mailto:ross.moore at mq.edu.au>
http://www.maths.mq.edu.au
[cid:image001.png at 01D030BE.D37A46F0]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. <http://mq.edu.au/>
<http://mq.edu.au/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/pdftex/attachments/20191206/0777232d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4605 bytes
Desc: image001.png
URL: <https://tug.org/pipermail/pdftex/attachments/20191206/0777232d/attachment-0001.png>


More information about the pdftex mailing list