[XeTeX] Whitespace in input

Sat Nov 19 09:39:15 CET 2011

Keith J. Schultz wrote:

> 	I do not think anybody disputes the fact that characters are not glyphs.
>
> 	The confusion arises that a character in CS is well defined and has a history.
> 	To be more exact it is just one byte in size so that there can be only 256 characters.

Sorry, Keith, this is patently untrue.  Replace "is" by "was once" and
you get a little closer to the truth, but you still completely ignore
issues such as the difference between (say) EBCDIC and ASCII.  CDC machines
used a 60-bit word, and one character was six bits, not eight.  And before
the advent of the extended character set, a character consisted of seven
bits plus a parity bit, thus yielding at most 128 characters of which
32 were reserved for control functions.

> 	The average user considers a glyph to be the same as a "letter" and thereby a character.

It is rarely safe to believe that one knows what the average user thinks ...

> 	Now, in order to process the glyphs with a computer it must be decomposed back to unicode.

But one rarely, if ever, "processes glyphs"; the glyphs are the end result,
not the input.  Glyph processing does become necessary in languages such
as Arabic, where context has a major impact on the way in which the
individual glyphs are presented, but in Western languages the nearest we
get to "glyph processing" is in the formation of ligature digraphs.

> 	How well this is done depends of the system its self. If the system is not fully unicode aware and
> 	implements in properly then there will be problems. What adds to the complexity of the problem is that
> 	not all fonts used for displaying unicode contain all code points, Thereby, creating your many to many
> 	decomposition.
>
> 	As for getting junk when copying unicode, just copy between to text using different fonts, where one font does
> 	not contain the glyph.
>
> 	The only true way to master this problem is if the computer world would go completely full unicode with
> 	fonts support the full unicode code set!

I personally hope that this does not happen, and that before then
we have an "Omnicode consortium" to review the mistakes of Unicode
and to address them in a future, more orthogonal, more consistent,
specification.

Philip Taylor