<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Hi Olivier,<br class="">
<div><br class="">
</div>
<div><br class="">
</div>
<div><br class="">
<blockquote type="cite" class="">
<div class="">On 7 Dec 2019, at 9:54 am, Olivier via pdftex <<a href="mailto:pdftex@tug.org" class="">pdftex@tug.org</a>> wrote:</div>
</blockquote>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">
<blockquote type="cite" class="">Since you are seeing some spaces with mutool , it begs a question:<br class="">
Does mutool have any parameters which affect how much space should be considered as an interword gap ?<br class="">
</blockquote>
<br class="">
I don't know, but if it exists, I don't think users have access to it.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>OK; fine.</div>
<div>I downloaded the source to mupdf , but it doesn’t say that it is compatible with MacOS, </div>
<div>so maybe I’ll not be able to compile it successfully.</div>
<div>Anyway, I’ll give it a try.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
<blockquote type="cite" class="">However, it *is* actually possible to make pdfLaTeX include (faked) inter-word spaces,<br class="">
using the primitive command:<br class="">
\pdfinterwordspaceon<br class="">
Try this with your example, before testing again with mutool .<br class="">
Does it make a difference?<br class="">
</blockquote>
<br class="">
Yes, it makes a difference. The result with `mutool` now shows a space between every letter, instead of "Lor e m”:<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Oh my. That is really something to laugh about.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">$ mutool draw -F txt test.pdf<br class="">
(...)<br class="">
L o r e m<br class="">
(...)<br class="">
<br class="">
For the same test, the result with `pdftotext` is not affected:<br class="">
<br class="">
$ pdftotext test.pdf -<br class="">
Lorem ipsum dolor sit amet (…)<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Yes; I’d already tried that.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
<blockquote type="cite" class="">By “faked”, the spaces have almost 0 width (roughly 10^{-5} points) on the PDF page, so they have no noticeable effect on the typeset layout.<br class="">
But when text is extracted they come out as a real space.<br class="">
</blockquote>
<br class="">
So, it seems that `mupdf` considers every tiny space as a fully qualified space.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Maybe.</div>
<div><br class="">
</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">But how comes that the search function of `mupdf` performs very well with 95% of the PDF files that I get from the internet?
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Well, there is certainly some selection going on there.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class="">Isn't it `pdflatex` that isn't conforming to a natural standard shared by the other 95%? [sorry for my lack of knowledge in that field]<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>There is no *standard* involved there, apart from the specification of the PDF language and what is possible with it.</div>
<div>Specification of what is, or is not, a word is the domain of *subset* standards such as PDF/A and PDF/UA.</div>
<div>I doubt that any of your examples claim validation for these.</div>
<div>(That’s exactly what I am working on --- for pdftex to be able to produce standards-conforming documents.)</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
I'm not accusing. I'm just trying to find how the situation could be improved, and which software should be improved, while avoiding the situation where both say "it's the fault of the other software”.<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Almost *all* PDF consuming software needs improvement; especially with regard to the published standards.</div>
<div>Consult:</div>
<a href="https://en.wikipedia.org/wiki/PDF" class="">https://en.wikipedia.org/wiki/PDF</a></div>
<div> <a href="https://en.wikipedia.org/wiki/PDF/A" class="">https://en.wikipedia.org/wiki/PDF/A</a></div>
<div> <a href="https://en.wikipedia.org/wiki/PDF/UA" class="">https://en.wikipedia.org/wiki/PDF/UA</a></div>
<div> <a href="https://en.wikipedia.org/wiki/PDF/E" class="">https://en.wikipedia.org/wiki/PDF/E</a></div>
<div> <a href="https://en.wikipedia.org/wiki/PDF/VT" class="">https://en.wikipedia.org/wiki/PDF/VT</a></div>
<div> <a href="https://en.wikipedia.org/wiki/PDF/X" class="">https://en.wikipedia.org/wiki/PDF/X</a></div>
<div><br class="">
<div><br class="">
</div>
<div>Different pieces of software have a tendency to concentrate on classes of documents that they do well.</div>
<div>In so doing, it is very easy to think that some things are standard, when in fact they are not.</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
Would it be sensible that I open a bug report against `mupdf`?<br class="">
</div>
</div>
</blockquote>
<div><br class="">
</div>
<div>Sure; go ahead.</div>
<div>It’s clearly wrong to put a space between every letter.</div>
<div>But really, what is the description of the functionality of mutool ,</div>
<div>when extracting content? </div>
<div>Does it claim to be extracting sensible parsed sentences,</div>
<div>or just to recognise text snippets, or character shapes?</div>
<br class="">
<blockquote type="cite" class="">
<div class="">
<div class=""><br class="">
Olivier<br class="">
<br class="">
</div>
</div>
</blockquote>
</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
Hope this helps.
<div class=""><br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>Ross</div>
<div class=""><br class="">
<div class=""><br class="">
Dr Ross Moore<br class="">
Department of Mathematics and Statistics
<div class="">12 Wally’s Walk, Level 7, Room 734<br class="">
Macquarie University, NSW 2109, Australia<br class="">
T: +61 2 9850 8955 | F: +61 2 9850 8114<br class="">
M:+61 407 288 255 | E: <a href="mailto:ross.moore@mq.edu.au" class="">ross.moore@mq.edu.au</a><br class="">
<a href="http://www.maths.mq.edu.au" class="">http://www.maths.mq.edu.au</a><span style="font-size: 12px; line-height: normal;"><a href="http://mq.edu.au/" target="_blank" style="font-size: 12px; line-height: normal;" class=""><span><br class="Apple-interchange-newline" style="caret-color: rgb(0, 105, 217); color: rgb(0, 105, 217); font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-family: Arial, sans-serif; orphans: 2; widows: 2;">
<span style="caret-color: rgb(0, 105, 217); color: rgb(0, 105, 217); font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-family: Arial, sans-serif; orphans: 2; widows: 2;"><span><span><span><span><img apple-inline="yes" id="B80C1386-3EBF-4051-A656-8F59A70FFDEF" src="cid:image001.png@01D030BE.D37A46F0" class=""></span><br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">CRICOS
Provider Number 00002J. Think before you</span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class=""> </span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">print. </span><br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">Please
consider the environment before printing this</span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class=""> </span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">email.</span><br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">This
message is intended for the addressee named</span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class=""> </span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">and
may </span><br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">contain
confidential information. If you are not the</span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class=""> </span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">intended </span><br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">recipient,
please delete it and notify the sender. Views</span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class=""> </span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">expressed </span><br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">in
this message are those of the individual sender, and</span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class=""> </span><span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">are
not </span><br style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none;" class="">
<span style="font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: Helvetica; text-decoration: none; float: none; display: inline !important;" class="">necessarily
the views of Macquarie University.</span> </span></span></span></span></span></a></span></div>
<a href="http://mq.edu.au/" target="_blank" style="font-size: 12px; line-height: normal;" class=""></a></div>
<br class="">
</div>
</body>
</html>