[XeTeX] Ligatures and searching in PDFs

David J. Perry hospes.primus at verizon.net
Tue Jun 8 04:00:30 CEST 2010

My understanding is that the only way to make PDFs 100% searchable is to use 
fonts that avoid the PUA altogether.  In other words, the 'Th' ligature 
would be accessible only via an OT or AAT feature such as discretionary 
ligatures; the user could not put it into a document by entering a PUA value 
because the 'Th' has none in the font.  Many font makers have included both 
PUA values and OT features in their fonts because of the lack of widespread 
support for advanced typographic features, particularly on Windows and 
Linux.  Some are now abandoning the PUA, partly because of the problem with 
seaching PDFs, as support on Windows gradually gets better.   You have to 
check the particular font you want to use.

XeTeX users, of course, have no problem with advanced features  :-)


----- Original Message ----- 
From: "Andy Lin" <kiryen at gmail.com>
To: <xetex at tug.org>
Sent: Monday, June 07, 2010 9:39 PM
Subject: Re: [XeTeX] Ligatures and searching in PDFs

It seems I misunderstood what exactly the TECkit mapping does. All it
does is change the input as instructed. All other "features" --
copy/paste and search compatibility -- I'd assumed was attributed to
TECkit is actually that of the PDF reader (in my case, Adobe Reader).

So, when Adobe Reader encounters the f-ligature, it knows to treat it
as 'f' and another character; they have specific Unicode code points
and thus any program can decompose them if they need to. However, the
'ch' and 'Th' ligatures in Linux Libertine are in the Private Use
Area, which are, by definition, non-standard, so they cannot be
anticipated by a PDF reader.

Now, I'm assuming it's possible to make these ligatures
copy/paste/search-able, just as it's possible to make small caps
searchable (although Charis SIL is the only I've found that's managed
it), but TECkit is not the way to do it. All TECkit does is take the
input, modify it based on the mapping, and pass the result to the
font/type engine without any additional information.

The reason why the TECkit mapping worked for the fonts I mentioned in
my previous post is because they had the ligatures at both the
standard Unicode codepoint and in the PUA, but for whatever reason,
had their ligature tables point to the PUA glyph. At least, I think
that's what was happening.

If I am mistaken, please correct me.

-Andy Lin

> I had noticed that the ligatures 'ch' and 'Th' are not searchable in
> Linux Libertine. I added the following mappings:
> U+0063 U+0068 <> U+E03B ; ch -> ch ligature
> U+0054 U+0068 <> U+E049 ; Th -> Th ligature
> But these do not make it possible to search or copy/paste as uncompiled.
> The .tec file is compiled correctly and XeTeX finds it. Any thoughts?

Subscriptions, Archive, and List information, etc.:

More information about the XeTeX mailing list