[XeTeX] New feature REQUEST for xetex

Jonathan Kew jfkthame at gmail.com
Tue Feb 23 11:06:41 CET 2016


On 23/2/16 02:54, Andrew Cunningham wrote:
> It would probably more than double, i was under the impression that
> ActualText was a tag attrubute, so extensive tagging would be needed,
> and actual text added to the tags.

The ActualText tagging is highly compressible, so in practice the 
increase in overall PDF size is not all that great.

>
> But the question is how to practically make use of ActualText if there
> is a visible text layer.
>
> PDF/UA for instance leaves the question deliberately ambigious.
> ActualText is the way to make the content accessible, but developers
> creating tools for PDF do not actually have to process the ActualText.
>
> So to index and search PDF files you need to build a discovery system
> utilising tools that allow you to specify the use of ActualText in
> preference to a visible text layer.
>

Acrobat Reader uses it, if present, so that Copy/Paste from the PDF 
results in the correct Unicode text (more or less), and Find behaves as 
expected.

Other PDF readers (such as Apple's Preview) may well ignore the 
ActualText tagging, in which case it doesn't help. I don't know whether 
tools like Evince or Okular handle it....


I'm attaching two sample PDFs with a simple chunk of Hindi text (from 
the Unicode web site). The first, dev-old.pdf, is what XeTeX currently 
generates (using the "Annapurna SIL" OpenType font). In general, 
Copy/Paste and text search don't work very well -- a few characters may 
be OK, but others are junk.

The second sample, dev-actualtext.pdf, was generated with an 
experimental new \XeTeXgenerateactualtext feature, which automatically 
"tags" each word with an ActualText representation.

Some points to note:

- The file size is 24662 bytes, while dev-old was 22875 bytes. Not too 
bad. Of course, a lot of that is the embedded font data; with longer 
documents that have lots of text but only a few fonts, the difference 
would presumably be somewhat greater.

- Copy/Paste and Search work pretty well in Acrobat Reader. Not in 
Preview.app.

- Highlighting of selected text (in Acrobat Reader) is somewhat broken, 
apparently due to the ActualText tagging (it looks better in dev-old). 
This may be fixable by tweaking exactly how the tagging is written into 
the PDF; I haven't investigated it further.


No guarantees at this point as to whether/when this feature will 
actually be available. It was just a quick attempt to hack something up, 
to see how promising the results might be...

JK

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dev-old.pdf
Type: application/pdf
Size: 22875 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20160223/e1d8702c/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dev-actualtext.pdf
Type: application/pdf
Size: 24662 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20160223/e1d8702c/attachment-0003.pdf>


More information about the XeTeX mailing list