[XeTeX] New feature REQUEST for xetex

Tue Feb 23 12:06:04 CET 2016

I am attaching a sample pdf and it's OCRed text using Tesseract OCR (
https://github.com/tesseract-ocr/tesseract).

The resulting pdf allows for search as well as copy paste for devanagri
unicode text.

The pdf is rendered using the original image, but the OCRed text is
available as text layer making it a searchable pdf. I do not think it uses
'actualtext' but I could be wrong. It allows for search for letters/partial
words but the highlight is in the ballpark, not always on that exact letter.

(please note that search may not find the original text as displayed in pdf
because OCR is not accurate for devanagri).

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Feb 23, 2016 at 4:22 PM, Jonathan Kew <jfkthame at gmail.com> wrote:

> On 23/2/16 10:37, Zdenek Wagner wrote:
>
>> How Jonathan,
>>
>> how do you put the ActualText to PDF? Is it per syllable, or per word?
>>
>
> Per word.
>
> We have a commercial OCR software that can convert scanned PDF to pages
>> with selectable texts. I have not examined it thoroughly but it seems to
>> me that it analyzes the scanned image, splits it to subimages "per word"
>> and attaches ActualText to each word. In such a way it is impossible to
>> select just a group of characters, the smallest entity that can be
>> copied & pasted (or searched for) is a word. It might fix the
>> hignlighting problem but I am just guessing.
>>
>
> I don't think so. Even single-syllable words like भी don't highlight well
> in the example.
>
> (FWIW, it is possible to search for a substring within a word, and Acrobat
> finds it OK, but it can't accurately highlight what's been found; you get
> the same (inaccurate) highlighting of the word regardless of what substring
> within it was searched.)
>
> Setting ActualText per syllable would make finer-grained copy/paste
> possible (currently, entire words are always copied), but would be
> significantly more complex to implement (as well as adding to the PDF file
> bloat). I think the per-word version should be a useful start, at least.
>
>
>>
>> Zdeněk Wagner
>> http://ttsm.icpf.cas.cz/team/wagner.shtml
>> http://icebearsoft.euweb.cz
>>
>> 2016-02-23 11:06 GMT+01:00 Jonathan Kew <jfkthame at gmail.com
>> <mailto:jfkthame at gmail.com>>:
>>
>>
>>     On 23/2/16 02:54, Andrew Cunningham wrote:
>>
>>         It would probably more than double, i was under the impression
>> that
>>         ActualText was a tag attrubute, so extensive tagging would be
>>         needed,
>>         and actual text added to the tags.
>>
>>
>>     The ActualText tagging is highly compressible, so in practice the
>>     increase in overall PDF size is not all that great.
>>
>>
>>         But the question is how to practically make use of ActualText if
>>         there
>>         is a visible text layer.
>>
>>         PDF/UA for instance leaves the question deliberately ambigious.
>>         ActualText is the way to make the content accessible, but
>> developers
>>         creating tools for PDF do not actually have to process the
>>         ActualText.
>>
>>         So to index and search PDF files you need to build a discovery
>>         system
>>         utilising tools that allow you to specify the use of ActualText in
>>         preference to a visible text layer.
>>
>>
>>     Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
>>     results in the correct Unicode text (more or less), and Find behaves
>>     as expected.
>>
>>     Other PDF readers (such as Apple's Preview) may well ignore the
>>     ActualText tagging, in which case it doesn't help. I don't know
>>     whether tools like Evince or Okular handle it....
>>
>>
>>     I'm attaching two sample PDFs with a simple chunk of Hindi text
>>     (from the Unicode web site). The first, dev-old.pdf, is what XeTeX
>>     currently generates (using the "Annapurna SIL" OpenType font). In
>>     general, Copy/Paste and text search don't work very well -- a few
>>     characters may be OK, but others are junk.
>>
>>     The second sample, dev-actualtext.pdf, was generated with an
>>     experimental new \XeTeXgenerateactualtext feature, which
>>     automatically "tags" each word with an ActualText representation.
>>
>>     Some points to note:
>>
>>     - The file size is 24662 bytes, while dev-old was 22875 bytes. Not
>>     too bad. Of course, a lot of that is the embedded font data; with
>>     longer documents that have lots of text but only a few fonts, the
>>     difference would presumably be somewhat greater.
>>
>>     - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
>>     Preview.app.
>>
>>     - Highlighting of selected text (in Acrobat Reader) is somewhat
>>     broken, apparently due to the ActualText tagging (it looks better in
>>     dev-old). This may be fixable by tweaking exactly how the tagging is
>>     written into the PDF; I haven't investigated it further.
>>
>>
>>     No guarantees at this point as to whether/when this feature will
>>     actually be available. It was just a quick attempt to hack something
>>     up, to see how promising the results might be...
>>
>>     JK
>>
>>
>>
>>
>>     --------------------------------------------------
>>     Subscriptions, Archive, and List information, etc.:
>>     http://tug.org/mailman/listinfo/xetex
>>
>>
>>
>>
>>
>>
>> --------------------------------------------------
>> Subscriptions, Archive, and List information, etc.:
>>    http://tug.org/mailman/listinfo/xetex
>>
>>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20160223/23531061/attachment-0001.html>
-------------- next part --------------
श्नीग्म्याणैशायनम: ।।
नमस्तेऽस्तुगङ्गेत्वदङ्गंप्नसङ्गाद्भुजंगारत्तुस्ताक्नुद्ङ्गाल्लुवङ्गाप्तं
अनङ्गारिखाग्ससङ्गांक्खिम्राभुजङ्गाधिपग्ङ्गीकृत्तग्ङ्गामवन्ति" १।।

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sanskrit2003skt.jpg.pdf
Type: application/pdf
Size: 40536 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20160223/23531061/attachment-0001.pdf>