[XeTeX] Ligatures and searching in PDFs

Gareth Hughes garzohugo at gmail.com
Thu Jun 10 00:09:43 CEST 2010


Ross Moore wrote:
> However, PDF has two separate mechanisms to overcome this.
> 
>  1.  a CMap resource for the font
>  2.  the /ActualText  tagging construction
> 
> Concerning method 1.  CMap resources:
> 
> I don't know where that CMap resource is being constructed.
> Presumably it is by  xdvipdfmx  as it subsets the font
> for inclusion. Presumably it is getting information from
> the complete font itself.
> Is there a way to override some entries and get those
> ligatures pointing to letter combinations?
> Again, I don't know. Maybe someone else can comment.
> 
> Concerning method 2.  /ActualText tagging:
> 
> Here is an example document that demonstrates how it
>  a.  does work with pdfTeX
> but
>  b. produces broken PDFs with XeTeX + xdvipdfmx .
> 
> When processed by XeTeX this file produces a PDF that is readable
> in both Apple's Preview, and in Adobe Reader and Acrobat Pro.
> 
> However, Acrobat Pro reports the content stream to be mal-formed.
> In neither case, using XeTeX, does  Copy/Paste respect the  /ActualText .
> 
> So my conclusion is that  xdvipdfmx  does not provide the method
> to put tagging directly into the content stream, thereby allowing
> /ActualText --- and other forms of tagging --- to be used.
> 
> pdfTeX, on the other hand, does allow this to some extent.
> That is, /ActualText works in some situations.
> Other kinds of tagging are more delicate, requiring an especially
> modified version of pdfTeX having extra primitives.
> 
> I gave a talk at the TUG 2009 meeting on this last year,
> and will be giving another at TUG 2010 in a few weeks from now.
> 
> You are not mistaken in that XeTeX cannot use /ActualText
> at present --- unless there have been some recent developments
> to  XeTeX  or  xdvipdfmx  of which I am not aware.
> (That's quite possibly the case.)

Obviously, it is important that we are able to produce PDFs that are
searchable and allow us to copy and paste plain text from them. We could
just try to use 'safe' ligatures, but that's not the point either.

What is more, I do a lot of work with Syriac, a cursive script for which
most joined shapes are encoded in the PUA or somewhere that's going
spare. This means that my XeTeX PDFs aren't searchable or copyable in
Syriac. Only one or two Syriac letters per word can be searched or copied.

Is it felt that this issue is a priority for the future development of
XeTeX? I feel that given XeTeX's speciality in working with Unicode and
OpenType this is an important issue.

Gareth.



More information about the XeTeX mailing list