[XeTeX] Ligatures and searching in PDFs

David J. Perry hospes.primus at verizon.net
Fri Jun 11 05:26:21 CEST 2010


Gareth,

Everything that Khaled said in his message is correct, particularly about 
PDFs relying on glyph names and about not using the Unicode presentation 
forms.  My comments about ligatures not having PUA assignments were written 
under the assumption that they were all correctly named (e.g., 'Th' as T_h), 
but I should have made that clear.

> In Estrangelo Edessa the joined glyphs are 'unmapped' (don't have
> Unicode code points). So, is it that they are unmapped that makes them
> unsearchable or that PDFs baulk at RTL scripts? Some Latin ligatures
> have Unicode code points at U+FB00-6. The Unicode blocks Arabic
> Presentation Forms-A and -B provide the joined forms for Arabic-based
> scripts, so it can be searched and copied from a PDF (although when
> pasted I get the isolated forms separated by spaces, which is still
> better than copying the raw joining glyphs).
What happens when you enter Arabic text (using standard Unicode fonts, 
Unicode-based word processor, etc.) is that the plain letters (= isolated 
forms) are stored in the document and OpenType features (or AAT features, on 
the Mac with an AAT font) are automatically applied by the OS to produce the 
correct display.  The presentation forms are not used.

Khaled didn't think that the PDF mechanism for going from display forms back 
to basic Unicode characters via glyph names was implemented except for the 
Latin script.  Yet it seems like it is to some extent for Arabic, since you 
get the isolated characters when you copy and paste correctly formed Arabic 
from a PDF--without it, you would get either nothing or (maybe, if you had 
the same font used to create the document) the display forms.  The 
implementation is not perfect (else you would not get the spaces, and 
without the spaces the correct display should automatically be produced by 
the OS).

Why doesn't it work for Syriac, at least to the same extent as it does for 
Arabic?  In Estrangelo Edessa, the names of the display forms are correct; I 
checked.  So it should be possible with this or other correctly-made fonts. 
It may be that Adobe has not done anything about Syriac support, since it is 
less widely used than Arabic.  There may also be significant cross-platform 
issues here too.  Are you on a Mac?  I think Mac OS still uses only AAT for 
Syriac (it now can use either AAT or OT for Arabic; that 's a recent 
development; Leopard or Snow Leopard, I forget which).  I don't work with 
Syriac, but I have one small bit of Syriac in a PDF made by XeTeX, and all 
the characters copied and pasted, but came out in the reverse direction (on 
Windows).  If you want to send me a better sample off-list I'll try it.

David 



More information about the XeTeX mailing list