[pdftex] pdflatex adds spaces between letters, which makes search impossible

Ross Moore ross.moore at mq.edu.au
Sat Dec 7 06:20:42 CET 2019


Hi Olivier,



On 7 Dec 2019, at 9:54 am, Olivier via pdftex <pdftex at tug.org<mailto:pdftex at tug.org>> wrote:

Since you are seeing some spaces with  mutool , it begs a question:
  Does  mutool  have any parameters which affect how much space should be considered as an interword gap ?

I don't know, but if it exists, I don't think users have access to it.

OK; fine.
I downloaded the source to  mupdf , but it doesn’t say that it is compatible with MacOS,
so maybe I’ll not be able to compile it successfully.
Anyway, I’ll give it a try.


However, it *is* actually possible to make  pdfLaTeX  include (faked) inter-word spaces,
using the primitive command:
     \pdfinterwordspaceon
Try this with your example, before testing again with  mutool .
Does it make a difference?

Yes, it makes a difference. The result with `mutool` now shows a space between every letter, instead of "Lor e m”:

Oh my. That is really something to laugh about.

$ mutool draw -F txt test.pdf
(...)
L o r e m
(...)

For the same test, the result with `pdftotext` is not affected:

$ pdftotext test.pdf -
Lorem ipsum dolor sit amet (…)

Yes; I’d already tried that.


By “faked”, the spaces have almost 0 width (roughly 10^{-5} points) on the PDF page, so they have no noticeable effect on the typeset layout.
But when text is extracted they come out as a real space.

So, it seems that `mupdf` considers every tiny space as a fully qualified space.

Maybe.


But how comes that the search function of `mupdf` performs very well with 95% of the PDF files that I get from the internet?

Well, there is certainly some selection going on there.

Isn't it `pdflatex` that isn't conforming to a natural standard shared by the other 95%? [sorry for my lack of knowledge in that field]

There is no *standard* involved there, apart from the specification of the PDF language and what is possible with it.
Specification of what is, or is not, a word is the domain of *subset* standards such as PDF/A and PDF/UA.
I doubt that any of your examples claim validation for these.
(That’s exactly what I am working on --- for pdftex to be able to produce standards-conforming documents.)


I'm not accusing. I'm just trying to find how the situation could be improved, and which software should be improved, while avoiding the situation where both say "it's the fault of the other software”.

Almost *all* PDF consuming software needs improvement; especially with regard to the published standards.
Consult:
    https://en.wikipedia.org/wiki/PDF
    https://en.wikipedia.org/wiki/PDF/A
    https://en.wikipedia.org/wiki/PDF/UA
    https://en.wikipedia.org/wiki/PDF/E
    https://en.wikipedia.org/wiki/PDF/VT
    https://en.wikipedia.org/wiki/PDF/X


Different pieces of software have a tendency to concentrate on classes of documents that they do well.
In so doing, it is very easy to think that some things are standard, when in fact they are not.


Would it be sensible that I open a bug report against `mupdf`?

Sure; go ahead.
It’s clearly wrong to put a space between every letter.
But really, what is the description of the functionality of  mutool ,
when extracting content?
Does it claim to be extracting sensible parsed sentences,
or just to recognise text snippets, or character shapes?


Olivier



Hope this helps.

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore at mq.edu.au<mailto:ross.moore at mq.edu.au>
http://www.maths.mq.edu.au
[cid:image001.png at 01D030BE.D37A46F0]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. <http://mq.edu.au/>
<http://mq.edu.au/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://tug.org/pipermail/pdftex/attachments/20191207/28f91f4a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4605 bytes
Desc: image001.png
URL: <https://tug.org/pipermail/pdftex/attachments/20191207/28f91f4a/attachment-0001.png>


More information about the pdftex mailing list