[XeTeX] search arabic text in pdf using adobe reader 7.0

François Charette firmicus at ankabut.net
Wed Feb 6 15:09:47 CET 2008


Jonathan Kew a écrit :
> On 6 Feb 2008, at 9:00 am, François Charette wrote:
>   
>> This seems to be an issue (not only for copying but also for  
>> searching) with the font Scheherazade, which also occurs when it is  
>> typeset with plain xetex (and so is not related to your operating  
>> system or your PDF viewer). In fact, only *isolated* characters can  
>> be correctly copied or searched, the other characters come out, as  
>> you say, as "garbage" (actually as characters with code-points  
>> above U+100000, in the so-called "Supplementary Private Use Area B"  
>> of Unicode). I suppose Jonathan should be able to tell us more  
>> about this...
>>     
>
> It's actually an issue with xdvipdfmx, I think. I have just (two  
> minutes ago) fixed a bug that prevented the proper ToUnicode mappings  
> being generated for unencoded glyphs (such as contextual forms).
>
> The Linotype font worked differently because (I assume) it encodes  
> all the contextual forms in the Arabic Presentation Forms blocks, and  
> then Adobe Reader probably "knows" to map these back to the Basic  
> Arabic letters. But that whole approach is flawed, as not all  
> characters have a full set of Presentation Form codepoints; this is  
> even more obvious in the case of complex calligraphic fonts with many  
> variants. So relying on the glyphs having direct Unicode mappings in  
> the 'cmap' is inherently inadequate.
>
> xdvipdfmx tries to deal with this by generating additional ToUnicode  
> mappings from the glyph names, wherever possible, but there was a bug  
> in that code. It should work better now.
>
>   

That makes perfect sense. Thanks for that informative report and for the 
bugfix in xdvipdfmx! I'll compile the new version from svn later on. 
I'll let you know if I encounter further problems.

> Another issue, though, is directionality (and character reordering,  
> in the case of Indic scripts); I doubt this is handled properly yet.  
> In principle, I think the only robust solution would be the use of  
> the ActualText feature in PDF, but that is not yet supported.
>   
I guess this is probably not handled correctly now. I had never heard of 
the ActualText feature, but I just consulted §10.8.3 of the PDF 
Reference v1.7. Still not entirely clear to me how that relates to 
directionality... Perhaps together with /ReversedChars ? Well I 
obviously know too little about PDF internals :)

FC




More information about the XeTeX mailing list