[pdftex] pdflatex adds spaces between letters, which makes search impossible

Olivier oc-spam65 at laposte.net
Fri Dec 6 23:54:49 CET 2019


Thank you for your explanations and observations.

> TeX normally does not include space characters between words.
> PDF consuming software must use heuristics to deduce word boundaries on such PDFs.
> So depending on what software you use, you can get different results ...

> Since you are seeing some spaces with  mutool , it begs a question:
> 
>    Does  mutool  have any parameters which affect how much space should be 
> considered as an interword gap ?

I don't know, but if it exists, I don't think users have access to it.

> However, it *is* actually possible to make  pdfLaTeX  include (faked) 
> inter-word spaces,
> using the primitive command:
> 
>       \pdfinterwordspaceon
> 
> Try this with your example, before testing again with  mutool .
> Does it make a difference?

Yes, it makes a difference. The result with `mutool` now shows a space between 
every letter, instead of "Lor e m":

$ mutool draw -F txt test.pdf
(...)
L o r e m
(...)

For the same test, the result with `pdftotext` is not affected:

$ pdftotext test.pdf -
Lorem ipsum dolor sit amet (...)

> By “faked”, the spaces have almost 0 width (roughly 10^{-5} points) on the PDF 
> page, so they have no noticeable effect on the typeset layout.
> But when text is extracted they come out as a real space.

So, it seems that `mupdf` considers every tiny space as a fully qualified space.

But how comes that the search function of `mupdf` performs very well with 95% 
of the PDF files that I get from the internet? Isn't it `pdflatex` that isn't 
conforming to a natural standard shared by the other 95%? [sorry for my lack 
of knowledge in that field]

I'm not accusing. I'm just trying to find how the situation could be improved, 
and which software should be improved, while avoiding the situation where both 
say "it's the fault of the other software".

Would it be sensible that I open a bug report against `mupdf`?

Olivier


More information about the pdftex mailing list