[XeTeX] New feature REQUEST for xetex

Tue Feb 23 11:52:22 CET 2016

On 23/2/16 10:37, Zdenek Wagner wrote:
> How Jonathan,
>
> how do you put the ActualText to PDF? Is it per syllable, or per word?

Per word.

> We have a commercial OCR software that can convert scanned PDF to pages
> with selectable texts. I have not examined it thoroughly but it seems to
> me that it analyzes the scanned image, splits it to subimages "per word"
> and attaches ActualText to each word. In such a way it is impossible to
> select just a group of characters, the smallest entity that can be
> copied & pasted (or searched for) is a word. It might fix the
> hignlighting problem but I am just guessing.

I don't think so. Even single-syllable words like भी don't highlight 
well in the example.

(FWIW, it is possible to search for a substring within a word, and 
Acrobat finds it OK, but it can't accurately highlight what's been 
found; you get the same (inaccurate) highlighting of the word regardless 
of what substring within it was searched.)

Setting ActualText per syllable would make finer-grained copy/paste 
possible (currently, entire words are always copied), but would be 
significantly more complex to implement (as well as adding to the PDF 
file bloat). I think the per-word version should be a useful start, at 
least.

>
>
> Zdeněk Wagner
> http://ttsm.icpf.cas.cz/team/wagner.shtml
> http://icebearsoft.euweb.cz
>
> 2016-02-23 11:06 GMT+01:00 Jonathan Kew <jfkthame at gmail.com
> <mailto:jfkthame at gmail.com>>:
>
>     On 23/2/16 02:54, Andrew Cunningham wrote:
>
>         It would probably more than double, i was under the impression that
>         ActualText was a tag attrubute, so extensive tagging would be
>         needed,
>         and actual text added to the tags.
>
>
>     The ActualText tagging is highly compressible, so in practice the
>     increase in overall PDF size is not all that great.
>
>
>         But the question is how to practically make use of ActualText if
>         there
>         is a visible text layer.
>
>         PDF/UA for instance leaves the question deliberately ambigious.
>         ActualText is the way to make the content accessible, but developers
>         creating tools for PDF do not actually have to process the
>         ActualText.
>
>         So to index and search PDF files you need to build a discovery
>         system
>         utilising tools that allow you to specify the use of ActualText in
>         preference to a visible text layer.
>
>
>     Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
>     results in the correct Unicode text (more or less), and Find behaves
>     as expected.
>
>     Other PDF readers (such as Apple's Preview) may well ignore the
>     ActualText tagging, in which case it doesn't help. I don't know
>     whether tools like Evince or Okular handle it....
>
>
>     I'm attaching two sample PDFs with a simple chunk of Hindi text
>     (from the Unicode web site). The first, dev-old.pdf, is what XeTeX
>     currently generates (using the "Annapurna SIL" OpenType font). In
>     general, Copy/Paste and text search don't work very well -- a few
>     characters may be OK, but others are junk.
>
>     The second sample, dev-actualtext.pdf, was generated with an
>     experimental new \XeTeXgenerateactualtext feature, which
>     automatically "tags" each word with an ActualText representation.
>
>     Some points to note:
>
>     - The file size is 24662 bytes, while dev-old was 22875 bytes. Not
>     too bad. Of course, a lot of that is the embedded font data; with
>     longer documents that have lots of text but only a few fonts, the
>     difference would presumably be somewhat greater.
>
>     - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
>     Preview.app.
>
>     - Highlighting of selected text (in Acrobat Reader) is somewhat
>     broken, apparently due to the ActualText tagging (it looks better in
>     dev-old). This may be fixable by tweaking exactly how the tagging is
>     written into the PDF; I haven't investigated it further.
>
>
>     No guarantees at this point as to whether/when this feature will
>     actually be available. It was just a quick attempt to hack something
>     up, to see how promising the results might be...
>
>     JK
>
>
>
>
>     --------------------------------------------------
>     Subscriptions, Archive, and List information, etc.:
>     http://tug.org/mailman/listinfo/xetex
>
>
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>    http://tug.org/mailman/listinfo/xetex
>