[XeTeX] Whitespace in input
Keith J. Schultz
keithjschultz at web.de
Sat Nov 19 09:16:05 CET 2011
I do not think anybody disputes the fact that characters are not glyphs.
The confusion arises that a character in CS is well defined and has a history.
To be more exact it is just one byte in size so that there can be only 256 characters.
Unicode has change all this. and we have a unicode character which is of different sizes
depending on the unicode encoding used.
It gets even hairier as in unicode several unicode characters can be combined (composed).
the result to be output is known as a glyph!
The average user considers a glyph to be the same as a "letter" and thereby a character.
Now, in order to process the glyphs with a computer it must be decomposed back to unicode.
How well this is done depends of the system its self. If the system is not fully unicode aware and
implements in properly then there will be problems. What adds to the complexity of the problem is that
not all fonts used for displaying unicode contain all code points, Thereby, creating your many to many
As for getting junk when copying unicode, just copy between to text using different fonts, where one font does
not contain the glyph.
The only true way to master this problem is if the computer world would go completely full unicode with
fonts support the full unicode code set!
That is impractical for the time being.
The only advise I can give is choose your tools wisely.
Am 18.11.2011 um 23:51 schrieb Zdenek Wagner:
> 2011/11/18 maxwell <maxwell at umiacs.umd.edu>:
>> On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
>> <zdenek.wagner at gmail.com>
>>> 2011/11/18 Philip TAYLOR <P.Taylor at rhul.ac.uk>:
>>>> Is it safe to assume that these "code listings"
>>>> are restricted to the ASCII character set ? If
>>>> so, yes, spaces are likely to be a problem, but
>>>> if the code listing can also include ligature-
>>>> digraphs, then these are likely to prove even
>>>> more problematic.
>>> If the code listing is typeset in a fixed width font, it is usually no
>>> problem. I copied a few code samples from books in PDF, most of them
>>> were typeset by TeX. If I want to copy text in Devanagari, it is
>>> almost impossible.
>> Besides TeX, Dr. Knuth also invented Literate Programming. In our own
>> project, we use LP to extract the code listings from the original source
>> code, rather than from the PDF. One advantage is that in addition to the
>> re-ordering at the character level (mentioned in part of Zdenek's email
>> that I didn't copy over), this allows re-ordering at any arbitrary level,
> This is a demonstration that glyphs are not the same as characters. I
> will startt with a simpler case and will not put Devanagari to the
> mail message. If you wish to write a syllable RU, you have to add a
> dependent vowel (matra) U to a consonant RA. There is a ligature RU,
> so in PDF you will not see RA consonant with U matra but a RU glyph.
> Similarly, TRA is a single glyph representing the following
> characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
> mappings thus it is possible to handle these cases when copying text
> from a PDF or when searching. More difficult case is I matra (short
> dependent vowel I). As a character it must always follow a consonant
> (this is a general rule for all dependent vowels) but visually (as a
> glyph) it precedes the consonant group after which it is pronounced.
> The sample word was kitab (it means a book). In Unicode (as
> characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
> I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
> script) runs the character sequence through ICU and the result is the
> glyph sequence. The original sequence is lost so that when the text is
> copied from PDF, we get (not exactly) i*katab. Microsoft suggested
> what additional characters should appear in Indic OpenType fonts. One
> of them is a dotted ring which denotes a missing consonant. I-matra
> must always follow a consonant (in character order). If it is moved to
> the beginning of a word, it is wrong. If you paste it to a text
> editor, the OpenType rendering engine should display a missing
> consonant as a dotted ring (if it is present in the font). In
> character order the dotted ring will precede I-matra but in visual
> (glyph) order it will be just opposite. Thus the asterisk shows the
> place where you will see the dotted circle. This is just one simple
> case. I-matra may follow a consonant group, such as in word PRIY
> (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
> which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
> words will start with the I-matra glyph. The latter will contain two
> ordering bugs after copy&paste. Consider also word MURTI (statue)
> which is a sequence of characters
> MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
> appear as an accent below the MA glyph. The next glyph will be I-matra
> followed by TA followed by RA shown as an upper accent at the right
> edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
> glyph appears at the end of the syllable although locically (in
> character order) it belongs to the beginning. These cases cannot be
> solved by toUnicode map because many-to-many mappings are not allowed.
> Moreover, a huge amount of mappings will be needed. It would be better
> to do the reverse processing independent of toUnicode mappings, to use
> ICU or Pango or Uniscribe or whatever to analyze the glyphs and
> convert them to characters. The rules are unambiguous but AR does not
> do it.
> We discuss nonbreakable spaces while we are not yet able to convert
> properly printable glyphs to characters when doing copy&paste from
> Zdeněk Wagner
> Subscriptions, Archive, and List information, etc.:
More information about the XeTeX