[XeTeX] New feature REQUEST for xetex

Tue Feb 23 11:37:03 CET 2016

How Jonathan,

how do you put the ActualText to PDF? Is it per syllable, or per word? We
have a commercial OCR software that can convert scanned PDF to pages with
selectable texts. I have not examined it thoroughly but it seems to me that
it analyzes the scanned image, splits it to subimages "per word" and
attaches ActualText to each word. In such a way it is impossible to select
just a group of characters, the smallest entity that can be copied & pasted
(or searched for) is a word. It might fix the hignlighting problem but I am
just guessing.

Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2016-02-23 11:06 GMT+01:00 Jonathan Kew <jfkthame at gmail.com>:

> On 23/2/16 02:54, Andrew Cunningham wrote:
>
>> It would probably more than double, i was under the impression that
>> ActualText was a tag attrubute, so extensive tagging would be needed,
>> and actual text added to the tags.
>>
>
> The ActualText tagging is highly compressible, so in practice the increase
> in overall PDF size is not all that great.
>
>
>> But the question is how to practically make use of ActualText if there
>> is a visible text layer.
>>
>> PDF/UA for instance leaves the question deliberately ambigious.
>> ActualText is the way to make the content accessible, but developers
>> creating tools for PDF do not actually have to process the ActualText.
>>
>> So to index and search PDF files you need to build a discovery system
>> utilising tools that allow you to specify the use of ActualText in
>> preference to a visible text layer.
>>
>>
> Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
> results in the correct Unicode text (more or less), and Find behaves as
> expected.
>
> Other PDF readers (such as Apple's Preview) may well ignore the ActualText
> tagging, in which case it doesn't help. I don't know whether tools like
> Evince or Okular handle it....
>
>
> I'm attaching two sample PDFs with a simple chunk of Hindi text (from the
> Unicode web site). The first, dev-old.pdf, is what XeTeX currently
> generates (using the "Annapurna SIL" OpenType font). In general, Copy/Paste
> and text search don't work very well -- a few characters may be OK, but
> others are junk.
>
> The second sample, dev-actualtext.pdf, was generated with an experimental
> new \XeTeXgenerateactualtext feature, which automatically "tags" each word
> with an ActualText representation.
>
> Some points to note:
>
> - The file size is 24662 bytes, while dev-old was 22875 bytes. Not too
> bad. Of course, a lot of that is the embedded font data; with longer
> documents that have lots of text but only a few fonts, the difference would
> presumably be somewhat greater.
>
> - Copy/Paste and Search work pretty well in Acrobat Reader. Not in
> Preview.app.
>
> - Highlighting of selected text (in Acrobat Reader) is somewhat broken,
> apparently due to the ActualText tagging (it looks better in dev-old). This
> may be fixable by tweaking exactly how the tagging is written into the PDF;
> I haven't investigated it further.
>
>
> No guarantees at this point as to whether/when this feature will actually
> be available. It was just a quick attempt to hack something up, to see how
> promising the results might be...
>
> JK
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20160223/fc036344/attachment-0001.html>