[XeTeX] New feature REQUEST for xetex
Jonathan Kew
jfkthame at gmail.com
Tue Feb 23 11:06:41 CET 2016
On 23/2/16 02:54, Andrew Cunningham wrote:
> It would probably more than double, i was under the impression that
> ActualText was a tag attrubute, so extensive tagging would be needed,
> and actual text added to the tags.
The ActualText tagging is highly compressible, so in practice the
increase in overall PDF size is not all that great.
>
> But the question is how to practically make use of ActualText if there
> is a visible text layer.
>
> PDF/UA for instance leaves the question deliberately ambigious.
> ActualText is the way to make the content accessible, but developers
> creating tools for PDF do not actually have to process the ActualText.
>
> So to index and search PDF files you need to build a discovery system
> utilising tools that allow you to specify the use of ActualText in
> preference to a visible text layer.
>
Acrobat Reader uses it, if present, so that Copy/Paste from the PDF
results in the correct Unicode text (more or less), and Find behaves as
expected.
Other PDF readers (such as Apple's Preview) may well ignore the
ActualText tagging, in which case it doesn't help. I don't know whether
tools like Evince or Okular handle it....
I'm attaching two sample PDFs with a simple chunk of Hindi text (from
the Unicode web site). The first, dev-old.pdf, is what XeTeX currently
generates (using the "Annapurna SIL" OpenType font). In general,
Copy/Paste and text search don't work very well -- a few characters may
be OK, but others are junk.
The second sample, dev-actualtext.pdf, was generated with an
experimental new \XeTeXgenerateactualtext feature, which automatically
"tags" each word with an ActualText representation.
Some points to note:
- The file size is 24662 bytes, while dev-old was 22875 bytes. Not too
bad. Of course, a lot of that is the embedded font data; with longer
documents that have lots of text but only a few fonts, the difference
would presumably be somewhat greater.
- Copy/Paste and Search work pretty well in Acrobat Reader. Not in
Preview.app.
- Highlighting of selected text (in Acrobat Reader) is somewhat broken,
apparently due to the ActualText tagging (it looks better in dev-old).
This may be fixable by tweaking exactly how the tagging is written into
the PDF; I haven't investigated it further.
No guarantees at this point as to whether/when this feature will
actually be available. It was just a quick attempt to hack something up,
to see how promising the results might be...
JK
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dev-old.pdf
Type: application/pdf
Size: 22875 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20160223/e1d8702c/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dev-actualtext.pdf
Type: application/pdf
Size: 24662 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20160223/e1d8702c/attachment-0003.pdf>
More information about the XeTeX
mailing list