<div dir="ltr"><div class="gmail_default" style="font-family:georgia,serif">I am attaching a sample pdf and it's OCRed text using Tesseract OCR (<a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>). </div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif">The resulting pdf allows for search as well as copy paste for devanagri unicode text. <br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif">The pdf is rendered using the original image, but the OCRed text is available as text layer making it a searchable pdf. I do not think it uses 'actualtext' but I could be wrong. It allows for search for letters/partial words but the highlight is in the ballpark, not always on that exact letter.</div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif">(please note that search may not find the original text as displayed in pdf because OCR is not accurate for devanagri). <br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div><div class="gmail_default" style="font-family:georgia,serif"><br></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr">ShreeDevi<br>____________________________________________________________<br>भजन - कीर्तन - आरती @ <a href="http://bhajans.ramparivar.com" target="_blank">http://bhajans.ramparivar.com</a><br></div></div></div>

<br><div class="gmail_quote">On Tue, Feb 23, 2016 at 4:22 PM, Jonathan Kew <span dir="ltr"><<a href="mailto:jfkthame@gmail.com" target="_blank">jfkthame@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 23/2/16 10:37, Zdenek Wagner wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

How Jonathan,<br>

<br>

how do you put the ActualText to PDF? Is it per syllable, or per word?<br>

</blockquote>

<br></span>

Per word.<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

We have a commercial OCR software that can convert scanned PDF to pages<br>

with selectable texts. I have not examined it thoroughly but it seems to<br>

me that it analyzes the scanned image, splits it to subimages "per word"<br>

and attaches ActualText to each word. In such a way it is impossible to<br>

select just a group of characters, the smallest entity that can be<br>

copied & pasted (or searched for) is a word. It might fix the<br>

hignlighting problem but I am just guessing.<br>

</blockquote>

<br></span>

I don't think so. Even single-syllable words like भी don't highlight well in the example.<br>

<br>

(FWIW, it is possible to search for a substring within a word, and Acrobat finds it OK, but it can't accurately highlight what's been found; you get the same (inaccurate) highlighting of the word regardless of what substring within it was searched.)<br>

<br>

Setting ActualText per syllable would make finer-grained copy/paste possible (currently, entire words are always copied), but would be significantly more complex to implement (as well as adding to the PDF file bloat). I think the per-word version should be a useful start, at least.<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">

<br>

<br>

Zdeněk Wagner<br>

<a href="http://ttsm.icpf.cas.cz/team/wagner.shtml" rel="noreferrer" target="_blank">http://ttsm.icpf.cas.cz/team/wagner.shtml</a><br>

<a href="http://icebearsoft.euweb.cz" rel="noreferrer" target="_blank">http://icebearsoft.euweb.cz</a><br>

<br>

2016-02-23 11:06 GMT+01:00 Jonathan Kew <<a href="mailto:jfkthame@gmail.com" target="_blank">jfkthame@gmail.com</a><br></span>

<mailto:<a href="mailto:jfkthame@gmail.com" target="_blank">jfkthame@gmail.com</a>>>:<div><div class="h5"><br>

<br>

    On 23/2/16 02:54, Andrew Cunningham wrote:<br>

<br>

        It would probably more than double, i was under the impression that<br>

        ActualText was a tag attrubute, so extensive tagging would be<br>

        needed,<br>

        and actual text added to the tags.<br>

<br>

<br>

    The ActualText tagging is highly compressible, so in practice the<br>

    increase in overall PDF size is not all that great.<br>

<br>

<br>

        But the question is how to practically make use of ActualText if<br>

        there<br>

        is a visible text layer.<br>

<br>

        PDF/UA for instance leaves the question deliberately ambigious.<br>

        ActualText is the way to make the content accessible, but developers<br>

        creating tools for PDF do not actually have to process the<br>

        ActualText.<br>

<br>

        So to index and search PDF files you need to build a discovery<br>

        system<br>

        utilising tools that allow you to specify the use of ActualText in<br>

        preference to a visible text layer.<br>

<br>

<br>

    Acrobat Reader uses it, if present, so that Copy/Paste from the PDF<br>

    results in the correct Unicode text (more or less), and Find behaves<br>

    as expected.<br>

<br>

    Other PDF readers (such as Apple's Preview) may well ignore the<br>

    ActualText tagging, in which case it doesn't help. I don't know<br>

    whether tools like Evince or Okular handle it....<br>

<br>

<br>

    I'm attaching two sample PDFs with a simple chunk of Hindi text<br>

    (from the Unicode web site). The first, dev-old.pdf, is what XeTeX<br>

    currently generates (using the "Annapurna SIL" OpenType font). In<br>

    general, Copy/Paste and text search don't work very well -- a few<br>

    characters may be OK, but others are junk.<br>

<br>

    The second sample, dev-actualtext.pdf, was generated with an<br>

    experimental new \XeTeXgenerateactualtext feature, which<br>

    automatically "tags" each word with an ActualText representation.<br>

<br>

    Some points to note:<br>

<br>

    - The file size is 24662 bytes, while dev-old was 22875 bytes. Not<br>

    too bad. Of course, a lot of that is the embedded font data; with<br>

    longer documents that have lots of text but only a few fonts, the<br>

    difference would presumably be somewhat greater.<br>

<br>

    - Copy/Paste and Search work pretty well in Acrobat Reader. Not in<br>

    Preview.app.<br>

<br>

    - Highlighting of selected text (in Acrobat Reader) is somewhat<br>

    broken, apparently due to the ActualText tagging (it looks better in<br>

    dev-old). This may be fixable by tweaking exactly how the tagging is<br>

    written into the PDF; I haven't investigated it further.<br>

<br>

<br>

    No guarantees at this point as to whether/when this feature will<br>

    actually be available. It was just a quick attempt to hack something<br>

    up, to see how promising the results might be...<br>

<br>

    JK<br>

<br>

<br>

<br>

<br>

    --------------------------------------------------<br>

    Subscriptions, Archive, and List information, etc.:<br>

    <a href="http://tug.org/mailman/listinfo/xetex" rel="noreferrer" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>

<br>

<br>

<br>

<br>

<br>

<br>

--------------------------------------------------<br>

Subscriptions, Archive, and List information, etc.:<br>

   <a href="http://tug.org/mailman/listinfo/xetex" rel="noreferrer" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>

<br>

</div></div></blockquote><div class="HOEnZb"><div class="h5">

<br>

<br>

<br>

--------------------------------------------------<br>

Subscriptions, Archive, and List information, etc.:<br>

 <a href="http://tug.org/mailman/listinfo/xetex" rel="noreferrer" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>

</div></div></blockquote></div><br></div>