[XeTeX] how to do (better) searchable PDFs in xelatex?

Mon Oct 15 21:54:41 CEST 2012

Hi Peter, Jonathan,

On 16/10/2012, at 2:02, Peter Baker <psb6m at virginia.edu> wrote:

> On 10/15/12 10:59 AM, Jonathan Kew wrote:
>> 
>> That's exactly the problem - these glyphs are encoded at PUA codepoints, so that's what (most) tools will give you as the corresponding character data. If they were unencoded, (some) tools would use the glyph names to infer the relevant characters, which would work better.
>> 
>>> Small caps are named like "a.sc" and they are unencoded.
>> And as they're unencoded, (some) tools will look at the glyph name and map it to the appropriate character.
> 
> I've been trying to explain this:  but Jonathan does it much better than I did, and with more authority.

Yes, but why would he tools be designed this way?
Surely unencoded means that the code-point has not been assigned yet, and may be assigned in future. So using these is asking for trouble.
Was not the intention of PUA to be the place to put characters that you need now, but have no corresponding Unicode point? This is precisely where using the font name should work. Or am I missing something?

So why would the tool be designed to infer the right composition of characters when a ligature is properly named at an unencoded point, but that same algorithm is not used when it is at a PUA point?

> 
> P.

Perplexed.

    Ross

PS. would not this be particulr issue with ligatures be resolved with a /ToUnicode  CMap for the font, which can do one–many assignments. 
Yes, this does not handle the many–one and many–many requirements of complex scripts, but that isn't what was being reported here, and is a much harder recognition problem.
Besides, it isn't clear there what copy-paste should best produce. Nor how to specify the desired search.