[XeTeX] Ligatures and searching in PDFs

Thu Jun 10 20:01:33 CEST 2010

On Thu, Jun 10, 2010 at 06:26:12PM +0100, Gareth Hughes wrote:
> David J. Perry wrote:
> > I am curious; are you using standard Unicode Syriac fonts?  In such
> > fonts, there is no need for, nor should there be, PUA assignments for
> > the joined shapes.  (And any font whose maker puts joined shapes
> > "somewhere that's going to spare" needs to go back to Unicode 101 and
> > learn some good practices.  There is no such place in Unicode and
> > putting one's private characters in codepoints marked reserved or used
> > for other scripts is really bad.)  I just looked at the Estrangelo
> > Edessa font and it (correctly) has no PUA assignments for other than the
> > isolated shapes.  (If you are using older fonts, created before Syriac
> > was supported in Unicode, of course there will be all sorts of
> > nonstandard things.  But we can't use those to judge whether XeTeX is
> > doing the right thing.)
> > 
> > Another fundamental question is whether Adobe even claims that rtl or
> > mixed directional text can be searched or copied correctly from a PDF. 
> > I did some googling on RTL support in PDFs and didn't really find an
> > answer.  But the overall support for RTL in PDF seems pretty spotty,
> > which is perhaps not surprising given Adobe's track record with RTL in
> > other products such as InDesign.  So the non-searchable PDFs may not be
> > the fault of XeTeX.  If you or anyone else knows the answer, please let
> > us know--I agree with you completely that it is an important issue.
> > 
> > David
> 
> Thanks, David.
> 
> In Estrangelo Edessa the joined glyphs are 'unmapped' (don't have
> Unicode code points). So, is it that they are unmapped that makes them
> unsearchable or that PDFs baulk at RTL scripts? Some Latin ligatures
> have Unicode code points at U+FB00-6. The Unicode blocks Arabic
> Presentation Forms-A and -B provide the joined forms for Arabic-based
> scripts, so it can be searched and copied from a PDF (although when
> pasted I get the isolated forms separated by spaces, which is still
> better than copying the raw joining glyphs). This would suggest that a
> similar thing would be possible for Syriac too, but, like a 'Th'
> ligature, would we need to have explicit Unicode code points for
> these to work properly? For starters, how much is possible within XeTeX,
> and how much would require mass lobbying of Adobe or the Unicode
> Consortium to make it work?

No way, Unicode explicitly states that they will not encode any
new ligatures or presentation forms. All such glyphs in Unicode were
included for round compatibility with older encodings, and their use in
new texts is highly discouraged. Arabic presentation forms don't make
searching or extracting Arabic text any better, actually, neither the
Latin f ligature. Adobe reader, IIRC, have some heuristics to guess the
code point from glyph name, so it is highly encouraged to name such
glyphs in a way it recognizes. For example, a glyph named i.sc will
be copied as regular lower case i, and a glyph named f_f_i will give the
sequence ffi etc. but I don't recall the exact details offhand. This
seems to be implemented for Latin script only, and as far as my
experience goes, there is no PDF generator or reader that
generates/retrieves full Arabic text into/from PDFs.

It might be possible to coerce LuaTeX to generate ActualText tags by
some node list processing magic, but I don't have a clear picture of how
such implementation would look, yet.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer