[XeTeX] Whitespace in input

Sat Nov 19 00:11:12 CET 2011

Hi Zdenek,

On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote:

> This is a demonstration that glyphs are not the same as characters. I
> will startt with a simpler case and will not put Devanagari to the
> mail message. If you wish to write a syllable RU, you have to add a
> dependent vowel (matra) U to a consonant RA. There is a ligature RU,
> so in PDF you will not see RA consonant with U matra but a RU glyph.
> Similarly, TRA is a single glyph representing the following
> characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
> mappings thus it is possible to handle these cases when copying text
> from a PDF or when searching. More difficult case is I matra (short
> dependent vowel I). As a character it must always follow a consonant
> (this is a general rule for all dependent vowels) but visually (as a
> glyph) it precedes the consonant group after which it is pronounced.
> The sample word was kitab (it means a book). In Unicode (as
> characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
> I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
> script) runs the character sequence through ICU and the result is the
> glyph sequence. The original sequence is lost so that when the text is
> copied from PDF, we get (not exactly) i*katab.

/ActualText is your friend here.
You tag the content and provide the string that you want to appear
with Copy/Paste as the value associated to a dictionary key.

There is a macro package that can do this with pdfTeX, and it is 
a vital part of my Tagged PDF work for mathematics.
Also, I have an example where the CJK.sty package is extended
to tag Chinese characters built from multiple glyphs so that
Copy/Paste works correctly (modulo PDF reader quirks).

Not sure about XeTeX.

I once tried to talk with Jonathan Kew about what would be needed 
to implement this properly, but he got totally the wrong idea 
concerning glyphs and characters, and what was needed to be done
internally and what by macros. The conversation went nowhere.

> Microsoft suggested
> what additional characters should appear in Indic OpenType fonts. One
> of them is a dotted ring which denotes a missing consonant. I-matra
> must always follow a consonant (in character order). If it is moved to
> the beginning of a word, it is wrong. If you paste it to a text
> editor, the OpenType rendering engine should display a missing
> consonant as a dotted ring (if it is present in the font). In
> character order the dotted ring will precede I-matra but in visual
> (glyph) order it will be just opposite. Thus the asterisk shows the
> place where you will see the dotted circle. This is just one simple
> case. I-matra may follow a consonant group, such as in word PRIY
> (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
> which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
> words will start with the I-matra glyph. The latter will contain two
> ordering bugs after copy&paste. Consider also word MURTI (statue)
> which is a sequence of characters

This sounds like each word needs its own /ActualText .
So some intricate programming is certainly necessary.
But \XeTeXinterchartoks  (is that the right spelling?)
should make this possible.

> MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
> appear as an accent below the MA glyph. The next glyph will be I-matra
> followed by TA followed by RA shown as an upper accent at the right
> edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
> glyph appears at the end of the syllable although locically (in
> character order) it belongs to the beginning. These cases cannot be
> solved by toUnicode map because many-to-many mappings are not allowed.

Agreed.  /ToUnicode  is not the right PDF construction for this.

> Moreover, a huge amount of mappings will be needed. It would be better
> to do the reverse processing independent of toUnicode mappings, to use
> ICU or Pango or Uniscribe or whatever to analyze the glyphs and
> convert them to characters. The rules are unambiguous but AR does not
> do it.

Having an external pre-procesor is what I do for tagging mathematics.
It seems like a similarly intricate problem here.

> 
> We discuss nonbreakable spaces while we are not yet able to convert
> properly printable glyphs to characters when doing copy&paste from
> PDF...

  :-)

> 
> 
> -- 
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz

Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------