[pdftex] pdflatex adds spaces between letters, which makes search impossible
Olivier
oc-spam65 at laposte.net
Fri Dec 6 23:54:49 CET 2019
Thank you for your explanations and observations.
> TeX normally does not include space characters between words.
> PDF consuming software must use heuristics to deduce word boundaries on such PDFs.
> So depending on what software you use, you can get different results ...
> Since you are seeing some spaces with mutool , it begs a question:
>
> Does mutool have any parameters which affect how much space should be
> considered as an interword gap ?
I don't know, but if it exists, I don't think users have access to it.
> However, it *is* actually possible to make pdfLaTeX include (faked)
> inter-word spaces,
> using the primitive command:
>
> \pdfinterwordspaceon
>
> Try this with your example, before testing again with mutool .
> Does it make a difference?
Yes, it makes a difference. The result with `mutool` now shows a space between
every letter, instead of "Lor e m":
$ mutool draw -F txt test.pdf
(...)
L o r e m
(...)
For the same test, the result with `pdftotext` is not affected:
$ pdftotext test.pdf -
Lorem ipsum dolor sit amet (...)
> By “faked”, the spaces have almost 0 width (roughly 10^{-5} points) on the PDF
> page, so they have no noticeable effect on the typeset layout.
> But when text is extracted they come out as a real space.
So, it seems that `mupdf` considers every tiny space as a fully qualified space.
But how comes that the search function of `mupdf` performs very well with 95%
of the PDF files that I get from the internet? Isn't it `pdflatex` that isn't
conforming to a natural standard shared by the other 95%? [sorry for my lack
of knowledge in that field]
I'm not accusing. I'm just trying to find how the situation could be improved,
and which software should be improved, while avoiding the situation where both
say "it's the fault of the other software".
Would it be sensible that I open a bug report against `mupdf`?
Olivier
More information about the pdftex
mailing list