<div dir="ltr"><div class="gmail_default" style="font-family:georgia,serif">I am attaching a sample pdf and it's OCRed text using Tesseract OCR (<a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>). </div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif">The resulting pdf allows for search as well as copy paste for devanagri unicode text. <br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif">The pdf is rendered using the original image, but the OCRed text is available as text layer making it a searchable pdf. I do not think it uses 'actualtext' but I could be wrong. It allows for search for letters/partial words but the highlight is in the ballpark, not always on that exact letter.</div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif">(please note that search may not find the original text as displayed in pdf because OCR is not accurate for devanagri). <br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr">ShreeDevi<br>____________________________________________________________<br>भजन - कीर्तन - आरती @ <a href="http://bhajans.ramparivar.com" target="_blank">http://bhajans.ramparivar.com</a><br></div></div></div>
<br><div class="gmail_quote">On Tue, Feb 23, 2016 at 4:22 PM, Jonathan Kew <span dir="ltr"><<a href="mailto:jfkthame@gmail.com" target="_blank">jfkthame@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 23/2/16 10:37, Zdenek Wagner wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
How Jonathan,<br>
<br>
how do you put the ActualText to PDF? Is it per syllable, or per word?<br>
</blockquote>
<br></span>
Per word.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
We have a commercial OCR software that can convert scanned PDF to pages<br>
with selectable texts. I have not examined it thoroughly but it seems to<br>
me that it analyzes the scanned image, splits it to subimages "per word"<br>
and attaches ActualText to each word. In such a way it is impossible to<br>
select just a group of characters, the smallest entity that can be<br>
copied & pasted (or searched for) is a word. It might fix the<br>
hignlighting problem but I am just guessing.<br>
</blockquote>
<br></span>
I don't think so. Even single-syllable words like भी don't highlight well in the example.<br>
<br>
(FWIW, it is possible to search for a substring within a word, and Acrobat finds it OK, but it can't accurately highlight what's been found; you get the same (inaccurate) highlighting of the word regardless of what substring within it was searched.)<br>
<br>
Setting ActualText per syllable would make finer-grained copy/paste possible (currently, entire words are always copied), but would be significantly more complex to implement (as well as adding to the PDF file bloat). I think the per-word version should be a useful start, at least.<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
<br>
<br>
Zdeněk Wagner<br>
<a href="http://ttsm.icpf.cas.cz/team/wagner.shtml" rel="noreferrer" target="_blank">http://ttsm.icpf.cas.cz/team/wagner.shtml</a><br>
<a href="http://icebearsoft.euweb.cz" rel="noreferrer" target="_blank">http://icebearsoft.euweb.cz</a><br>
<br>
2016-02-23 11:06 GMT+01:00 Jonathan Kew <<a href="mailto:jfkthame@gmail.com" target="_blank">jfkthame@gmail.com</a><br></span>
<mailto:<a href="mailto:jfkthame@gmail.com" target="_blank">jfkthame@gmail.com</a>>>:<div><div class="h5"><br>
<br>
On 23/2/16 02:54, Andrew Cunningham wrote:<br>
<br>
It would probably more than double, i was under the impression that<br>
ActualText was a tag attrubute, so extensive tagging would be<br>
needed,<br>
and actual text added to the tags.<br>
<br>
<br>
The ActualText tagging is highly compressible, so in practice the<br>
increase in overall PDF size is not all that great.<br>
<br>
<br>
But the question is how to practically make use of ActualText if<br>
there<br>
is a visible text layer.<br>
<br>
PDF/UA for instance leaves the question deliberately ambigious.<br>
ActualText is the way to make the content accessible, but developers<br>
creating tools for PDF do not actually have to process the<br>
ActualText.<br>
<br>
So to index and search PDF files you need to build a discovery<br>
system<br>
utilising tools that allow you to specify the use of ActualText in<br>
preference to a visible text layer.<br>
<br>
<br>
Acrobat Reader uses it, if present, so that Copy/Paste from the PDF<br>
results in the correct Unicode text (more or less), and Find behaves<br>
as expected.<br>
<br>
Other PDF readers (such as Apple's Preview) may well ignore the<br>
ActualText tagging, in which case it doesn't help. I don't know<br>
whether tools like Evince or Okular handle it....<br>
<br>
<br>
I'm attaching two sample PDFs with a simple chunk of Hindi text<br>
(from the Unicode web site). The first, dev-old.pdf, is what XeTeX<br>
currently generates (using the "Annapurna SIL" OpenType font). In<br>
general, Copy/Paste and text search don't work very well -- a few<br>
characters may be OK, but others are junk.<br>
<br>
The second sample, dev-actualtext.pdf, was generated with an<br>
experimental new \XeTeXgenerateactualtext feature, which<br>
automatically "tags" each word with an ActualText representation.<br>
<br>
Some points to note:<br>
<br>
- The file size is 24662 bytes, while dev-old was 22875 bytes. Not<br>
too bad. Of course, a lot of that is the embedded font data; with<br>
longer documents that have lots of text but only a few fonts, the<br>
difference would presumably be somewhat greater.<br>
<br>
- Copy/Paste and Search work pretty well in Acrobat Reader. Not in<br>
Preview.app.<br>
<br>
- Highlighting of selected text (in Acrobat Reader) is somewhat<br>
broken, apparently due to the ActualText tagging (it looks better in<br>
dev-old). This may be fixable by tweaking exactly how the tagging is<br>
written into the PDF; I haven't investigated it further.<br>
<br>
<br>
No guarantees at this point as to whether/when this feature will<br>
actually be available. It was just a quick attempt to hack something<br>
up, to see how promising the results might be...<br>
<br>
JK<br>
<br>
<br>
<br>
<br>
--------------------------------------------------<br>
Subscriptions, Archive, and List information, etc.:<br>
<a href="http://tug.org/mailman/listinfo/xetex" rel="noreferrer" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>
<br>
<br>
<br>
<br>
<br>
<br>
--------------------------------------------------<br>
Subscriptions, Archive, and List information, etc.:<br>
<a href="http://tug.org/mailman/listinfo/xetex" rel="noreferrer" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>
<br>
</div></div></blockquote><div class="HOEnZb"><div class="h5">
<br>
<br>
<br>
--------------------------------------------------<br>
Subscriptions, Archive, and List information, etc.:<br>
<a href="http://tug.org/mailman/listinfo/xetex" rel="noreferrer" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>
</div></div></blockquote></div><br></div>