<div dir="ltr"><div>How Jonathan,<br><br></div>how do you put the ActualText to PDF? Is it per syllable, or per word? We have a commercial OCR software that can convert scanned PDF to pages with selectable texts. I have not examined it thoroughly but it seems to me that it analyzes the scanned image, splits it to subimages "per word" and attaches ActualText to each word. In such a way it is impossible to select just a group of characters, the smallest entity that can be copied & pasted (or searched for) is a word. It might fix the hignlighting problem but I am just guessing.<br><br></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature">Zdeněk Wagner<br><a href="http://ttsm.icpf.cas.cz/team/wagner.shtml" target="_blank">http://ttsm.icpf.cas.cz/team/wagner.shtml</a><br><a href="http://icebearsoft.euweb.cz" target="_blank">http://icebearsoft.euweb.cz</a></div></div>
<br><div class="gmail_quote">2016-02-23 11:06 GMT+01:00 Jonathan Kew <span dir="ltr"><<a href="mailto:jfkthame@gmail.com" target="_blank">jfkthame@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 23/2/16 02:54, Andrew Cunningham wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
It would probably more than double, i was under the impression that<br>
ActualText was a tag attrubute, so extensive tagging would be needed,<br>
and actual text added to the tags.<br>
</blockquote>
<br></span>
The ActualText tagging is highly compressible, so in practice the increase in overall PDF size is not all that great.<span class=""><br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
But the question is how to practically make use of ActualText if there<br>
is a visible text layer.<br>
<br>
PDF/UA for instance leaves the question deliberately ambigious.<br>
ActualText is the way to make the content accessible, but developers<br>
creating tools for PDF do not actually have to process the ActualText.<br>
<br>
So to index and search PDF files you need to build a discovery system<br>
utilising tools that allow you to specify the use of ActualText in<br>
preference to a visible text layer.<br>
<br>
</blockquote>
<br></span>
Acrobat Reader uses it, if present, so that Copy/Paste from the PDF results in the correct Unicode text (more or less), and Find behaves as expected.<br>
<br>
Other PDF readers (such as Apple's Preview) may well ignore the ActualText tagging, in which case it doesn't help. I don't know whether tools like Evince or Okular handle it....<br>
<br>
<br>
I'm attaching two sample PDFs with a simple chunk of Hindi text (from the Unicode web site). The first, dev-old.pdf, is what XeTeX currently generates (using the "Annapurna SIL" OpenType font). In general, Copy/Paste and text search don't work very well -- a few characters may be OK, but others are junk.<br>
<br>
The second sample, dev-actualtext.pdf, was generated with an experimental new \XeTeXgenerateactualtext feature, which automatically "tags" each word with an ActualText representation.<br>
<br>
Some points to note:<br>
<br>
- The file size is 24662 bytes, while dev-old was 22875 bytes. Not too bad. Of course, a lot of that is the embedded font data; with longer documents that have lots of text but only a few fonts, the difference would presumably be somewhat greater.<br>
<br>
- Copy/Paste and Search work pretty well in Acrobat Reader. Not in Preview.app.<br>
<br>
- Highlighting of selected text (in Acrobat Reader) is somewhat broken, apparently due to the ActualText tagging (it looks better in dev-old). This may be fixable by tweaking exactly how the tagging is written into the PDF; I haven't investigated it further.<br>
<br>
<br>
No guarantees at this point as to whether/when this feature will actually be available. It was just a quick attempt to hack something up, to see how promising the results might be...<span class="HOEnZb"><font color="#888888"><br>
<br>
JK<br>
<br>
</font></span><br><br>
<br>
--------------------------------------------------<br>
Subscriptions, Archive, and List information, etc.:<br>
<a href="http://tug.org/mailman/listinfo/xetex" rel="noreferrer" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>
<br></blockquote></div><br></div>