[XeTeX] Whitespace in input

Sat Nov 19 13:51:40 CET 2011

2011/11/19 Keith J. Schultz <keithjschultz at web.de>:
> Hi Zdenek,
>
>        I do not think anybody disputes the fact that characters are not glyphs.
>
>        The confusion arises that a character in CS is well defined and has a history.
>        To be more exact it is just one byte in size so that there can be only 256 characters.
>
>        Unicode has change all this. and we have a unicode character which is of different sizes
>        depending on the unicode encoding used.
>
>        It gets even hairier as in unicode several unicode characters can be combined (composed).
>        the result to be output is known as a glyph!
>
>        The average user considers a glyph to be the same as a "letter" and thereby a character.
>
>        Now, in order to process the glyphs with a computer it must be decomposed back to unicode.
>        How well this is done depends of the system its self. If the system is not fully unicode aware and
>        implements in properly then there will be problems. What adds to the complexity of the problem is that
>        not all fonts used for displaying unicode contain all code points, Thereby, creating your many to many
>        decomposition.
>
No, conversion of a sequence of glyphs to a sequence of unicode
codepoints has little to do with fonts. Position of RU ligature in the
font may differ, but it is handled easily by the toUnicode map.
Conjunct STA may also occupy different position in different fonts but
it can always be printed using two glyphs, half-SA + TA. In general,
the half forms should be decoded as the full form followed by VIRAMA.
This makes the toUnicode table smaller and leads to correct results.
The only problem is correct ordering of a few characters.

>        As for getting junk when copying unicode, just copy between to text using different fonts, where one font does
>        not contain the glyph.
>
When performing copy&paste or text search in PDF, I am not interested
in glyphs but in characters. I do not care what glyphs will be
displayed. If I copy the text to OpenOffice, I can change the font
later and if the codepoints were transferred correctly, I will see the
text (it was true even with OpenOffice 1.x, I tried many years ago).
If I copy the text to gedit, ontconfig will automatically find a font
for displaying the characters not present in the current font. I still
have to read the fontconfig manual in order to find how to configure
its searching algorithm. Arabic fonts may be a problem especially if
you wish to use Arabic, Persian and Urdu. Now I know that I have to
force fontonfic to select automatically SIL Scheherezade because it
contains all characters. I can thus use both U+0643 and U+06A. When
writing Akbar, I can write it both in Arabic and in Urdu/Farsi.

>        The only true way to master this problem is if the computer world would go completely full unicode with
>        fonts support the full unicode code set!
>
>        That is impractical for the time being.
>
fontconfig currently has the solution and usually works out of the box.
>        The only advise I can give is choose your tools wisely.
>
>        regards
>                Keith.
>
> Am 18.11.2011 um 23:51 schrieb Zdenek Wagner:
>
>> 2011/11/18 maxwell <maxwell at umiacs.umd.edu>:
>>> On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
>>> <zdenek.wagner at gmail.com>
>>> wrote:
>>>> 2011/11/18 Philip TAYLOR <P.Taylor at rhul.ac.uk>:
>>>>> Is it safe to assume that these "code listings"
>>>>> are restricted to the ASCII character set ?  If
>>>>> so, yes, spaces are likely to be a problem, but
>>>>> if the code listing can also include ligature-
>>>>> digraphs, then these are likely to prove even
>>>>> more problematic.
>>>>>
>>>> If the code listing is typeset in a fixed width font, it is usually no
>>>> problem. I copied a few code samples from books in PDF, most of them
>>>> were typeset by TeX. If I want to copy text in Devanagari, it is
>>>> almost impossible.
>>>
>>> Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
>>> project, we use LP to extract the code listings from the original source
>>> code, rather than from the PDF.  One advantage is that in addition to the
>>> re-ordering at the character level (mentioned in part of Zdenek's email
>>> that I didn't copy over), this allows re-ordering at any arbitrary level,
>>
>> This is a demonstration that glyphs are not the same as characters. I
>> will startt with a simpler case and will not put Devanagari to the
>> mail message. If you wish to write a syllable RU, you have to add a
>> dependent vowel (matra) U to a consonant RA. There is a ligature RU,
>> so in PDF you will not see RA consonant with U matra but a RU glyph.
>> Similarly, TRA is a single glyph representing the following
>> characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
>> mappings thus it is possible to handle these cases when copying text
>> from a PDF or when searching. More difficult case is I matra (short
>> dependent vowel I). As a character it must always follow a consonant
>> (this is a general rule for all dependent vowels) but visually (as a
>> glyph) it precedes the consonant group after which it is pronounced.
>> The sample word was kitab (it means a book). In Unicode (as
>> characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
>> I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
>> script) runs the character sequence through ICU and the result is the
>> glyph sequence. The original sequence is lost so that when the text is
>> copied from PDF, we get (not exactly) i*katab. Microsoft suggested
>> what additional characters should appear in Indic OpenType fonts. One
>> of them is a dotted ring which denotes a missing consonant. I-matra
>> must always follow a consonant (in character order). If it is moved to
>> the beginning of a word, it is wrong. If you paste it to a text
>> editor, the OpenType rendering engine should display a missing
>> consonant as a dotted ring (if it is present in the font). In
>> character order the dotted ring will precede I-matra but in visual
>> (glyph) order it will be just opposite. Thus the asterisk shows the
>> place where you will see the dotted circle. This is just one simple
>> case. I-matra may follow a consonant group, such as in word PRIY
>> (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
>> which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
>> words will start with the I-matra glyph. The latter will contain two
>> ordering bugs after copy&paste. Consider also word MURTI (statue)
>> which is a sequence of characters
>> MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
>> appear as an accent below the MA glyph. The next glyph will be I-matra
>> followed by TA followed by RA shown as an upper accent at the right
>> edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
>> glyph appears at the end of the syllable although locically (in
>> character order) it belongs to the beginning. These cases cannot be
>> solved by toUnicode map because many-to-many mappings are not allowed.
>> Moreover, a huge amount of mappings will be needed. It would be better
>> to do the reverse processing independent of toUnicode mappings, to use
>> ICU or Pango or Uniscribe or whatever to analyze the glyphs and
>> convert them to characters. The rules are unambiguous but AR does not
>> do it.
>>
>> We discuss nonbreakable spaces while we are not yet able to convert
>> properly printable glyphs to characters when doing copy&paste from
>> PDF...
>>
>>
>> --
>> Zdeněk Wagner
>> http://hroch486.icpf.cas.cz/wagner/
>> http://icebearsoft.euweb.cz
>>
>>
>>
>> --------------------------------------------------
>> Subscriptions, Archive, and List information, etc.:
>>  http://tug.org/mailman/listinfo/xetex
>
>
>
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz