[XeTeX] Whitespace in input
zdenek.wagner at gmail.com
Sat Nov 19 00:30:58 CET 2011
2011/11/19 Ross Moore <ross.moore at mq.edu.au>:
> Hi Zdenek,
> On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote:
>> This is a demonstration that glyphs are not the same as characters. I
>> will startt with a simpler case and will not put Devanagari to the
>> mail message. If you wish to write a syllable RU, you have to add a
>> dependent vowel (matra) U to a consonant RA. There is a ligature RU,
>> so in PDF you will not see RA consonant with U matra but a RU glyph.
>> Similarly, TRA is a single glyph representing the following
>> characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
>> mappings thus it is possible to handle these cases when copying text
>> from a PDF or when searching. More difficult case is I matra (short
>> dependent vowel I). As a character it must always follow a consonant
>> (this is a general rule for all dependent vowels) but visually (as a
>> glyph) it precedes the consonant group after which it is pronounced.
>> The sample word was kitab (it means a book). In Unicode (as
>> characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
>> I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
>> script) runs the character sequence through ICU and the result is the
>> glyph sequence. The original sequence is lost so that when the text is
>> copied from PDF, we get (not exactly) i*katab.
> /ActualText is your friend here.
> You tag the content and provide the string that you want to appear
> with Copy/Paste as the value associated to a dictionary key.
I do not know whether the PDF specification has evolved since I read
it the last time. /ActualText allows only single-byte characters, ie
those with codes between 0 and 255, not arbitrary Unicode characters.
/ActualText is demonstrated on German hyphenated words such as Zucker
which is hyphenated as Zuk- ker. I have tried to put /ActualText
manually via a special, I could see it in the PDF file but it did not
When converting a white space to a space character some [complex]
heuristics is needed while proper conversion of glyphs to characters
of Indic scripts require just a few strict rules. The ligatures as TRA
have to appear in the toUnicode map, otherwise its meaning will be
unclear. If you see the I-matra, go to the last consonant in the
sequence and put the I-matra character there. If you see the RA glyph
at the right edge of a syllable, go back to the leftmost consonant in
the group and prepend RA+VIRAMA there. This is all what has to be done
with Devanagari. Other Indic scripts contain two-part vowels but the
rules will be similarly simple. We should not be forced to double the
size of the PDF file. AR and other PDF rendering programs should learn
these simple rules and use them when extracting text.
> There is a macro package that can do this with pdfTeX, and it is
> a vital part of my Tagged PDF work for mathematics.
> Also, I have an example where the CJK.sty package is extended
> to tag Chinese characters built from multiple glyphs so that
> Copy/Paste works correctly (modulo PDF reader quirks).
> Not sure about XeTeX.
> I once tried to talk with Jonathan Kew about what would be needed
> to implement this properly, but he got totally the wrong idea
> concerning glyphs and characters, and what was needed to be done
> internally and what by macros. The conversation went nowhere.
>> Microsoft suggested
>> what additional characters should appear in Indic OpenType fonts. One
>> of them is a dotted ring which denotes a missing consonant. I-matra
>> must always follow a consonant (in character order). If it is moved to
>> the beginning of a word, it is wrong. If you paste it to a text
>> editor, the OpenType rendering engine should display a missing
>> consonant as a dotted ring (if it is present in the font). In
>> character order the dotted ring will precede I-matra but in visual
>> (glyph) order it will be just opposite. Thus the asterisk shows the
>> place where you will see the dotted circle. This is just one simple
>> case. I-matra may follow a consonant group, such as in word PRIY
>> (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
>> which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
>> words will start with the I-matra glyph. The latter will contain two
>> ordering bugs after copy&paste. Consider also word MURTI (statue)
>> which is a sequence of characters
> This sounds like each word needs its own /ActualText .
> So some intricate programming is certainly necessary.
> But \XeTeXinterchartoks (is that the right spelling?)
> should make this possible.
>> MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
>> appear as an accent below the MA glyph. The next glyph will be I-matra
>> followed by TA followed by RA shown as an upper accent at the right
>> edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
>> glyph appears at the end of the syllable although locically (in
>> character order) it belongs to the beginning. These cases cannot be
>> solved by toUnicode map because many-to-many mappings are not allowed.
> Agreed. /ToUnicode is not the right PDF construction for this.
>> Moreover, a huge amount of mappings will be needed. It would be better
>> to do the reverse processing independent of toUnicode mappings, to use
>> ICU or Pango or Uniscribe or whatever to analyze the glyphs and
>> convert them to characters. The rules are unambiguous but AR does not
>> do it.
> Having an external pre-procesor is what I do for tagging mathematics.
> It seems like a similarly intricate problem here.
>> We discuss nonbreakable spaces while we are not yet able to convert
>> properly printable glyphs to characters when doing copy&paste from
>> Zdeněk Wagner
> Hope this helps,
> Ross Moore ross.moore at mq.edu.au
> Mathematics Department office: E7A-419
> Macquarie University tel: +61 (0)2 9850 8955
> Sydney, Australia 2109 fax: +61 (0)2 9850 8114
> Subscriptions, Archive, and List information, etc.:
More information about the XeTeX