[XeTeX] Whitespace in input

Zdenek Wagner zdenek.wagner at gmail.com
Fri Nov 18 23:51:54 CET 2011


2011/11/18 maxwell <maxwell at umiacs.umd.edu>:
> On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
> <zdenek.wagner at gmail.com>
> wrote:
>> 2011/11/18 Philip TAYLOR <P.Taylor at rhul.ac.uk>:
>>> Is it safe to assume that these "code listings"
>>> are restricted to the ASCII character set ?  If
>>> so, yes, spaces are likely to be a problem, but
>>> if the code listing can also include ligature-
>>> digraphs, then these are likely to prove even
>>> more problematic.
>>>
>> If the code listing is typeset in a fixed width font, it is usually no
>> problem. I copied a few code samples from books in PDF, most of them
>> were typeset by TeX. If I want to copy text in Devanagari, it is
>> almost impossible.
>
> Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
> project, we use LP to extract the code listings from the original source
> code, rather than from the PDF.  One advantage is that in addition to the
> re-ordering at the character level (mentioned in part of Zdenek's email
> that I didn't copy over), this allows re-ordering at any arbitrary level,

This is a demonstration that glyphs are not the same as characters. I
will startt with a simpler case and will not put Devanagari to the
mail message. If you wish to write a syllable RU, you have to add a
dependent vowel (matra) U to a consonant RA. There is a ligature RU,
so in PDF you will not see RA consonant with U matra but a RU glyph.
Similarly, TRA is a single glyph representing the following
characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
mappings thus it is possible to handle these cases when copying text
from a PDF or when searching. More difficult case is I matra (short
dependent vowel I). As a character it must always follow a consonant
(this is a general rule for all dependent vowels) but visually (as a
glyph) it precedes the consonant group after which it is pronounced.
The sample word was kitab (it means a book). In Unicode (as
characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
script) runs the character sequence through ICU and the result is the
glyph sequence. The original sequence is lost so that when the text is
copied from PDF, we get (not exactly) i*katab. Microsoft suggested
what additional characters should appear in Indic OpenType fonts. One
of them is a dotted ring which denotes a missing consonant. I-matra
must always follow a consonant (in character order). If it is moved to
the beginning of a word, it is wrong. If you paste it to a text
editor, the OpenType rendering engine should display a missing
consonant as a dotted ring (if it is present in the font). In
character order the dotted ring will precede I-matra but in visual
(glyph) order it will be just opposite. Thus the asterisk shows the
place where you will see the dotted circle. This is just one simple
case. I-matra may follow a consonant group, such as in word PRIY
(dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
words will start with the I-matra glyph. The latter will contain two
ordering bugs after copy&paste. Consider also word MURTI (statue)
which is a sequence of characters
MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
appear as an accent below the MA glyph. The next glyph will be I-matra
followed by TA followed by RA shown as an upper accent at the right
edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
glyph appears at the end of the syllable although locically (in
character order) it belongs to the beginning. These cases cannot be
solved by toUnicode map because many-to-many mappings are not allowed.
Moreover, a huge amount of mappings will be needed. It would be better
to do the reverse processing independent of toUnicode mappings, to use
ICU or Pango or Uniscribe or whatever to analyze the glyphs and
convert them to characters. The rules are unambiguous but AR does not
do it.

We discuss nonbreakable spaces while we are not yet able to convert
properly printable glyphs to characters when doing copy&paste from
PDF...


-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



More information about the XeTeX mailing list