[XeTeX] Ligatures and searching in PDFs

Andy Lin kiryen at gmail.com
Tue Jun 1 19:19:18 CEST 2010


Sorry to revive this topic, but I think I've found a solution.

The original post described a problem when using the rare ligatures
(e.g. "fty") in the Junicode font, in that the strings could not be
found by their decomposed characters. At the time, it was suggested
the /ActualText PDF feature would be useful, but no implementation was
given.

I'll save the details for how I stumbled onto the solution for another
time, but here's the result:

There are two ways about this: font encoding and text mapping. If you
have any Adobe OpenType fonts, you might have noticed that the ffi and
ffl ligatures can be copied from a PDF intact, but the fi and fl
ligatures will show up as ??. On the other hand, if you use Latin
Modern, you will not encounter any problem of the sort. This is
because the font tables in LM were done properly.

If your font does not have the proper tables, you can supplement them
with a TECkit mapping, which are quite powerful. (I posted in Sept '09
about using them for Inuktitut syllabary-romanization conversion, and
I've also used them for Persian script-transliteration conversion.)
You've probably used Mapping=tex-text at some point, and the solution
I'm proposing requires you to just add a couple of lines to the
tex-text.map file and compile it (you may wish to make a copy and make
changes to that).

When you open the tex-text.map file (in \fonts\misc\xetex\fontmapping
for miktex portable), you'll see mappings from individual characters
to composed unicode glyphs, for example:
; ligatures from Knuth's original CMR fonts
U+002D U+002D			<>	U+2013	; -- -> en dash
U+002D U+002D U+002D	<>	U+2014	; --- -> em dash

In order to make the common f/ff ligatures searchable in PDFs, add the
following lines and compile the map file with teckit_compile (should
be in the bin folder):
U+0066 U+0066	<>	U+FB00	; ff -> ff ligature
U+0066 U+0069	<>	U+FB01	; fi -> fi ligature
U+0066 U+006C	<>	U+FB02	; fl -> fl ligature
U+0066 U+0066 U+0069	<>	U+FB03	; ffi -> ffi ligature
U+0066 U+0066 U+006C	<>	U+FB04	; ffl -> ffl ligature

I've attached such a map file and the resulting tec file for those who
aren't interested in the nitty-gritty. Simply drop these into the
fonts\misc\xetex\fontmapping folder and run texhash/mktexlsr.

BTW, when you use this teckit mapping for ligatures, it bypasses the
OpenType ligature setting, i.e. you can't turn them off unless you use
a different mapping. And it won't check to see if your font has the
required glyphs. However, it does allow you to easily access ligatures
in fonts that don't have an OT ligature table (e.g. Times New Roman
and Georgia, which is why I made the map file in the first place).

Hope someone will find this useful.

-Andy Lin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tex-text-ms.map
Type: application/octet-stream
Size: 1043 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20100601/be30fde2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tex-text-ms.tec
Type: application/octet-stream
Size: 458 bytes
Desc: not available
URL: <http://tug.org/pipermail/xetex/attachments/20100601/be30fde2/attachment-0001.obj>


More information about the XeTeX mailing list