<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">It's all in the font, really. If an OT

      substitution results in a character from the font's PUA being

      inserted in the character stream (except for a few standard

      ligatures), then the result will be broken searches. Because of

      this, modern fonts (including those from Adobe) are avoiding the

      PUA and placing the targets of OT substitutions in unencoded slots

      with names that enable searches (like "Q.alt", "Q_u").<br>

      <br>

      My advice is to seek out fonts that avoid the PUA as much as

      possible (at least for standard features like entries in the

      "liga" table), and lobby the makers of fonts such as Libertine to

      start avoiding it as well. In the Libertine "liga" table, the pair

      "Qu" produces a ligature named "Q_u" that is at location U+E048 in

      the PUA. I see a number of other ligatures in that section of the

      PUA as well: fb, ffb, ffh, ffj, ffk, fft, fh, fj, fk, ft and so

      on. The result is a great many very nice looking PDFs that can't

      be searched reliably.<br>

      <br>

      Peter<br>

      <br>

      On 10/14/2012 06:04 PM, Andrew Cunningham wrote:<br>

    </div>

    <blockquote

cite="mid:CAGJ7U-VGR1EJQXNeCXdTVeAsquHR86L+oDwAeRzRcgLJTVU9qg@mail.gmail.com"

      type="cite">

      <p>This is the nature of the PDF format. It is a preprint format

        the focuses on glyphs rather than  characters</p>

      <p>It partly depends on the font, and the OT features being used.

      </p>

      <p>In theory you can have ActualText in the PDF, but once you move

        to complex scripts all bets are off. Without a complete rewrite

        of the PDF standard .... fidelity to the text is not really

        possible. PDF format wasn't designed to do it.</p>

      <p>The way we used PDFs is well outside the design parameters of

        the format.</p>

      <p>It is possible to extract text, but even at its optimal,

        post-processing would be needed to reorder characters in some

        complex scripts. <br>

      </p>

      <p>Andrew</p>

      <div class="gmail_quote">On Oct 15, 2012 7:57 AM, "Peter Dyballa"

        <<a moz-do-not-send="true" href="mailto:Peter_Dyballa@web.de">Peter_Dyballa@web.de</a>>

        wrote:<br type="attribution">

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <br>

          Am 14.10.2012 um 16:30 schrieb Joe Corneli:<br>

          <br>

          > However, if I extend the MWE there slightly, I can find

          "prefix", but<br>

          > not "quantitative".  (My PDF reader is Evince on Ubuntu

          12.04.)<br>

          <br>

          The capital Q is not what you see… GNU Emacs tells me:<br>

          <br>

                              character:  (displayed as ) (codepoint

          57416, #o160110, #xe048)<br>

                      preferred charset: unicode (Unicode (ISO10646))<br>

                  code point in charset: 0xE048<br>

          <br>

          The code point is in the PUA, Private Use Area. I used

          pdftotext version 0.20.4 to extract the text.<br>

          <br>

          When I use pdftohtml version 0.20.4 to extract the text and

          create HTML files, I see in OmniWeb the word: î ˆantitative…<br>

          <br>

          --<br>

          Greetings<br>

          <br>

            Pete<br>

          <br>

          Got Mole problems?<br>

          Call Avogadro 6.02 x 10^23<br>

          <br>

          <br>

          <br>

          <br>

          --------------------------------------------------<br>

          Subscriptions, Archive, and List information, etc.:<br>

            <a moz-do-not-send="true"

            href="http://tug.org/mailman/listinfo/xetex" target="_blank">http://tug.org/mailman/listinfo/xetex</a><br>

        </blockquote>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">

--------------------------------------------------

Subscriptions, Archive, and List information, etc.:

  <a class="moz-txt-link-freetext" href="http://tug.org/mailman/listinfo/xetex">http://tug.org/mailman/listinfo/xetex</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>