[XeTeX] potential new feature: \XeTeXgenerateactualtext

Jonathan Kew jfkthame at gmail.com
Wed Feb 24 11:06:42 CET 2016


On 24/2/16 09:22, ShreeDevi Kumar wrote:
> Testing dev-actualtext.pdf sent by JK
>
>   * Adobe Acrobat Reader XI on Windows 10
>       o Does not highlight text fully
>       o SEARCH finds words and word parts correctly but usually
>         highlights only beginning of the word containing the letter
>       o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
>       o Save as TXT file does not work correctly - only saves ... in it,
>         not the actual unicode text which can be copied

So it looks like Acrobat makes use of the ActualText for Search and 
Copy, but sadly its "Save as Text" doesn't support Unicode.

I'm pleasantly surprised to see the Gmail previewer also handles it.

The others (Foxit, Edge) sound like they're just working from the glyph 
stream, which is basically doomed to failure.

For a further data point, I tried Evince (Document Viewer) on Ubuntu 
15.10, and found that Copy and Search work well; it looks like it is 
using the ActualText correctly. This is thanks to the poppler library, I 
believe. The (poppler-based) "pdftotext" tool was also able to extract 
the Unicode text correctly from the PDF, although "pdftohtml" didn't do 
so well.

One issue with Evince is that drag-selecting text to highlight it (as 
for Copy/Paste) looks bad: the highlighting completely obscures the 
selected text, although it will end up being copied correctly. 
Interestingly, its highlighting of search results doesn't suffer from 
this problem, and it even makes a fair attempt (not completely accurate) 
at highlighting specific letters within a word, not just entire words.

JK


>   * Foxit Reader 7.3 on Windows 10
>       o Highlights text fully,
>       o smallest highlight unit is word,
>       o COPY paste to notepad++ as well as SEARCH does NOT work
>         correctly as Unicode text is not fully correct.
>
>             ूय
>
>             िनकोड क्या ह ? ै
>
>       o
>         ​Save as TXT file does not work correctly - saves the unicode
>         text with same problems as in copy and paste​
>
>   *
>     ​Microsoft Edge Viewer on Windows 10
>       o
>>         Highlights text fully,
>       o COPY paste to notepad++ as well as SEARCH does NOT work
>         correctly as Unicode text is not fully correct.
>
>                     य ूिनकोड क्या है?
>
>   *
>>     Previewing from within gmail in Chrome on Windows 10 -
>       o Highlights text fully,
>       o smallest highlight unit is word,
>       o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
>       o (highlights only first letter of first word in
>         paragraph यू rather than full word यूनिकोड)
>       o there is NO SEARCH feature
>       o there is no save as TXT file feature
>   * Same as above while Previewing from within gmail in Internet
>     Explorer on Windows 10
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Feb 23, 2016 at 11:30 PM, Jonathan Kew <jfkthame at gmail.com
> <mailto:jfkthame at gmail.com>> wrote:
>
>     On 23/2/16 17:39, Philip Taylor wrote:
>
>         Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1
>         allows
>         me to select only half of the text whereas Adobe Reader DC
>         allows me to
>         select it all; neither allows me to select individual kanji.
>
>
>     Ah, right... as there are no spaces between the kanji, they'll end
>     up in the same text object. That's a shortcoming of how the current
>     implementation works, for scripts that don't use inter-word spaces.
>
>     In either case, copy&paste actually gives you the whole text, even
>     though AAPro only highlights half of it, I guess?
>
>     JK
>
>
>
>
>     --------------------------------------------------
>     Subscriptions, Archive, and List information, etc.:
>     http://tug.org/mailman/listinfo/xetex
>
>
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>    http://tug.org/mailman/listinfo/xetex
>



More information about the XeTeX mailing list