[XeTeX] accented character ṛ within \section{ṛ}

Mon Apr 26 23:37:53 CEST 2010

Hi Michiel,

On 27/04/2010, at 5:56 AM, Michiel Kamermans wrote:

> On 4/26/2010 12:41 PM, Ross Moore wrote:
>> Hi Herb,
>>> Just curious... what happens when you try to do search within or  
>>> a copy from a pdf which has such combined characters?
>>
>> PDF has the /ActualText(...)  replacement tagging feature. This  
>> allows you to capture a sequence of content characters
>> and declare the whole collection to be equivalent to a single (or  
>> sequence of) Unicode point(s).
>
> But, that only works if you add an /ActualText command.

Yes, but this isn't too hard to implement ...

> As far as I can tell, using a compound glyph as discussed here will  
> not be a problem in a search, *provided* that the software you're  
> using implemented the unicode collation algorithm correctly, in  
> which case for this type of thing it shouldn't need the /ActualText  
> command for searching to work.

This is true --- if you have put the correct combining glyphs,
and the browser has done the right thing, as you say.
For copy/paste you also need to look at the result using
a font that supports the correct characters, or your OS
automatically chooses an alternate font to display them.

>
> That said, I have no idea how many PDF readers other than Adobe's  
> Acrobat actually use a correctly and fully implemented unicode  
> collation algorithm.

Agreed.
The most irritating thing about all this is the lack of
consistent support in different browsers for everything
that is now possible.

   ... so the great benefit of /ActualText is that you
can use whatever methods you like to get the onscreen
representation acceptable, then map this to the correct
Unicode point(s) for searchability and accessibility.

Among the possibilities are:

  *  mapping latin-1 characters in math to the proper
     math-italics in Unicode Plane-1.
     Similarly mapping ISO-greek to Plane-1 math-greek.

  *  large math-delimiters built from several characters
     mapped to a single bracket, brace or parenthesis.
     Similarly for constructed square-root signs, and
     extended over/under- braces and arrows.

  *  fake-bold characters built with overstrikes,
     mapped to a single character (e.g., CJK glyphs)
     when there is no bold equivalent to your font.

  *  mapping the old CJK Type1 fonts, having characters
     built from individual strokes, to their proper
     Unicode code-points.

I have examples of PDFs that have done these things,
generated from (La)TeX source, using pdfTeX.

You can probably think of other uses too.

>
> - Mike "Pomax" Kamermans
> nihongoresources.com

Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------