[XeTeX] Ligatures and searching in PDFs

Ross Moore ross.moore at mq.edu.au
Tue Jun 8 07:07:19 CEST 2010


Hi Andy,

On 08/06/2010, at 11:39 AM, Andy Lin wrote:

> It seems I misunderstood what exactly the TECkit mapping does. All it
> does is change the input as instructed. All other "features" --
> copy/paste and search compatibility -- I'd assumed was attributed to
> TECkit is actually that of the PDF reader (in my case, Adobe Reader).
>
> So, when Adobe Reader encounters the f-ligature, it knows to treat it
> as 'f' and another character; they have specific Unicode code points
> and thus any program can decompose them if they need to. However, the
> 'ch' and 'Th' ligatures in Linux Libertine are in the Private Use
> Area, which are, by definition, non-standard, so they cannot be
> anticipated by a PDF reader.

Yes, that is true.
However, PDF has two separate mechanisms to overcome this.

  1.  a CMap resource for the font
  2.  the /ActualText  tagging construction

>
> Now, I'm assuming it's possible to make these ligatures
> copy/paste/search-able, just as it's possible to make small caps
> searchable (although Charis SIL is the only I've found that's managed
> it), but TECkit is not the way to do it. All TECkit does is take the
> input, modify it based on the mapping, and pass the result to the
> font/type engine without any additional information.

That seems to be accurate.

>
> The reason why the TECkit mapping worked for the fonts I mentioned in
> my previous post is because they had the ligatures at both the
> standard Unicode codepoint and in the PUA, but for whatever reason,
> had their ligature tables point to the PUA glyph. At least, I think
> that's what was happening.


Concerning method 1.  CMap resources:

With "Linux Libertine O" a CMap is created on-the-fly, using the
characters that are used in the document.
e.g. for the following text:
      "Play in the field; riffle the deck."
(11 letters + 3 ligatures + 2 punctuation )
the CMap generated with a XeTeX run is:

>> /CIDInit /ProcSet findresource begin
>> 12 dict begin
>> begincmap
>> /CMapName /LinLibertineO/H/65536/0,000-UTF16 def
>> /CMapType 2 def
>> /CIDSystemInfo <<
>>   /Registry (Adobe)
>>   /Ordering (UCS)
>>   /Supplement 0
>> >> def
>> 1 begincodespacerange
>> <0000> <FFFF>
>> endcodespacerange
>> 17 beginbfchar
>> <000F> <002E>
>> <0012> <0031>
>> <001C> <003B>
>> <0031> <0050>
>> <0042> <0061>
>> <0045> <0064>
>> <0046> <0065>
>> <0049> <0068>
>> <004A> <0069>
>> <004D> <006C>
>> <004F> <006E>
>> <0053> <0072>
>> <0055> <0074>
>> <005A> <0079>
>> <08A2> <E03A>
>> <0977> <FB01>
>> <097A> <FB04>
>> endbfchar
>> endcmap
>> CMapName currentdict /CMap defineresource pop
>> end
>> end


Beware that '1' also occurs as the page number.

See how 3 font characters are mapped into the PUA area.

>> <08A2> <E03A>
>> <0977> <FB01>
>> <097A> <FB04>

For searchability, these really should be:

>> <08A2> <0063006B>
>> <0977> <00660069>
>> <097A> <00660066006C>

for the  ck , fi and  ffl  ligatures respectively.

I don't know where that CMap resource is being constructed.
Presumably it is by  xdvipdfmx  as it subsets the font
for inclusion. Presumably it is getting information from
the complete font itself.
Is there a way to override some entries and get those
ligatures pointing to letter combinations?
Again, I don't know. Maybe someone else can comment.



Concerning method 2.  /ActualText tagging:

Here is an example document that demonstrates how it
  a.  does work with pdfTeX
but
  b. produces broken PDFs with XeTeX + xdvipdfmx .


>>> \documentclass[11pt]{article}
>>> \usepackage{geometry}                % See geometry.pdf to learn  
>>> the layout options. There are lots.
>>> \geometry{letterpaper}                   % ... or a4paper or  
>>> a5paper or ...
>>>
>>> \usepackage{ifxetex,ifpdf}
>>>
>>> \ifxetex
>>> \usepackage{xltxtra}
>>> %\setmainfont{Charis SIL}
>>> \setmainfont{Linux Libertine O}
>>>
>>> \newcommand{\XetexActualText}[2]{%
>>>  \special{pdf:literal BT /Span <</ActualText<#2>>> BDC}#1\special 
>>> {pdf:literal  EMC ET}}
>>>
>>> \newcommand{\FFI}{\XetexActualText{ffi}{006600660069}}
>>> \newcommand{\FF}{\XetexActualText{ff}{00660066}}
>>> \newcommand{\FI}{\XetexActualText{fi}{00660069}}
>>> \newcommand{\FFL}{\XetexActualText{ffl}{00660066006c}}
>>> \newcommand{\FL}{\XetexActualText{fl}{0066006c}}
>>> \newcommand{\CK}{\XetexActualText{ck}{0063006b}}
>>> \fi
>>>
>>> \ifpdf
>>> \pdfcompresslevel 0
>>> \newcommand{\PDFTeXActualText}[2]{%
>>>  \pdfliteral direct {/Span<</ActualText<#2>>> BDC}#1\pdfliteral  
>>> direct {EMC}}
>>>
>>> \newcommand{\FFI}{\PDFTeXActualText{ffi}{006600660069}}
>>> \newcommand{\FF}{\PDFTeXActualText{ff}{00660066}}
>>> \newcommand{\FI}{\PDFTeXActualText{fi}{00660069}}
>>> \newcommand{\FFL}{\PDFTeXActualText{ffl}{00660066006C}}
>>> \newcommand{\FL}{\PDFTeXActualText{fl}{0066006C}}
>>> \newcommand{\CK}{\PDFTeXActualText{ck}{0063006b}}
>>>
>>> \fi
>>>
>>> \begin{document}
>>>
>>> Play in the {\FI}eld; ri{\FFL}e the de\CK.
>>>
>>> Play in the field; riffle the deck.
>>>
>>>
>>> \end{document}



When processed by XeTeX this file produces a PDF that is readable
in both Apple's Preview, and in Adobe Reader and Acrobat Pro.

However, Acrobat Pro reports the content stream to be mal-formed.
It looks as follows:

stream
  q 1 0 0 1 72 720 cm 0 G 0 g BT /F1 10.909 Tf 36.74 -34 Td 
[<0031>-11<004d0042005a>-250<004a004f>-250<005500490046>]TJ ET 1 0 0  
1 86.24 -34 cm BT /Span <</ActualText<00660069>>> BDC 1 0 0 1 -86.24  
34 cm BT /F1 10.909 Tf 86.24 -34 Td[<0977>]TJ ET 1 0 0 1 92.2 -34 cm  
EMC ET 1 0 0 1 -92.2 34 cm BT /F1 10.909 Tf 92.2 -34 Td 
[<0046004d0045001c>-249<0053004a>]TJ ET 1 0 0 1 117.09 -34 cm BT / 
Span <</ActualText<00660066006c>>> BDC 1 0 0 1 -117.09 34 cm BT /F1  
10.909 Tf 117.09 -34 Td[<097a>]TJ ET 1 0 0 1 125.65 -34 cm EMC ET 1 0  
0 1 -125.65 34 cm BT /F1 1 ...

Note how there is  ... BT ... ET ... BT /Span ... BDC ... BT ...  
ET ... EMC ET ...
when it really should be nested like:
     ... BT .... /Span ... BDC ... EMC ... ... ET ...

If a macro definition is changed to:

>>> \newcommand{\XetexActualText}[2]{%
>>>  \special{pdf:literal /Span <</ActualText<#2>>> BDC}#1\special 
>>> {pdf:literal  EMC}}

then the PDF content stream is still malformed.
So much so that Adobe software will not show anything,
even though Apple software does produce a display.

In neither case, using XeTeX, does  Copy/Paste respect the  / 
ActualText .


So my conclusion is that  xdvipdfmx  does not provide the method
to put tagging directly into the content stream, thereby allowing
/ActualText --- and other forms of tagging --- to be used.

pdfTeX, on the other hand, does allow this to some extent.
That is, /ActualText works in some situations.
Other kinds of tagging are more delicate, requiring an especially
modified version of pdfTeX having extra primitives.

I gave a talk at the TUG 2009 meeting on this last year,
and will be giving another at TUG 2010 in a few weeks from now.


>
> If I am mistaken, please correct me.

You are not mistaken in that XeTeX cannot use /ActualText
at present --- unless there have been some recent developments
to  XeTeX  or  xdvipdfmx  of which I am not aware.
(That's quite possibly the case.)

You are mistaken in that what you want is certainly doable,
so far as the PDF specifications are concerned.

>
> -Andy Lin
>
>> I had noticed that the ligatures 'ch' and 'Th' are not searchable in
>> Linux Libertine. I added the following mappings:
>> U+0063 U+0068   <>      U+E03B  ; ch -> ch ligature
>> U+0054 U+0068   <>      U+E049  ; Th -> Th ligature
>> But these do not make it possible to search or copy/paste as  
>> uncompiled.
>> The .tec file is compiled correctly and XeTeX finds it. Any thoughts?



Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------





More information about the XeTeX mailing list