[XeTeX] Whitespace in input

Mon Nov 14 17:30:11 CET 2011

I think this discussion is bogging down because several different
questions are getting mixed together.  Here's what I see as the major
issues:

1. Does Unicode specify a single correct way of representing white space?

2. If an input file to XeTeX contains currently less common Unicode
whitespace code points, such as U+00A0, what should XeTeX do?

3. Should users be encouraged, or even required, to include those code
points in input to XeTeX, in order to achieve typesetting goals that in
older TeX engines were achieved by other means?

4. Since many editing environments make it inconvenient to process
currently less common Unicode whitespace code points, what should users do
if the answer to #3 is "yes"?

Now, separate from identifying what the questions are, here's what I think
are reasonable answers to the questions:

1.  No.  That is not what Unicode is for.  Unicode's goal is to subsume
all reasonable pre-existing encodings.  Some reasonable pre-existing
encodings include a non-breaking space character, so Unicode includes one.
That does not mean Unicode says you should actually use it!  There are
many precedents of Unicode providing multiple ways of representing
things, as a result of including characters from other systems, without
it being reasonable to demand that all Unicode-compatible systems must
support all of them.  For instance, most of the U+FFxx range is devoted
to different kinds of hacks for handling partial-width characters in
Asian-language typesetting; the preferred way to do that nowadays is via
OpenType features, but the code points remain in the standard.  The U+0000
to U+001F range is basically control characters for Teletype machines;
some of those, like U+000A and U+000D, are widely used in modern documents
(but in varying ways by different systems!) and others, like U+001D, are
virtually unheard-of.  Unicode does NOT say everybody has to support them
all let alone all in the same way.

The U+00A0 code points is not explicitly deprecated in Unicode, but it was
never a principle of Unicode that all implementations have to support all
defined control characters regardless of appropriateness to the particular
purpose.  "Non-breaking space" is, from TeX's point of view, not really a
character at all, but a formatting command; and TeX already has a way of
dealing with formatting commands in general and this one in particular.
It is appropriate to say that the preferred way of handling non-breaking
spaces in TeX input is the existing TeX way; and saying that in NO WAY AT
ALL contradicts anything in Unicode.  Unicode is servant, not master.

2. Inevitably, people will include invalid characters in TeX input; and
U+00A0 is an invalid character for TeX input.  The best way to deal with
it is to treat it like any other invalid character and generate an error
message.  A reasonable alternative would be to say "it is whitespace; it
will be treated like other whitespace."  That would mean ignoring its
breaking/non-breaking-ness, as we have for a long time similarly ignored
the special properties of U+0009 (tab).  Of course, if users want to
define a special meaning for U+00A0 in their own input, they can do so
with the existing mechanisms for redefining the meanings of input
characters; but "U+00A0 is equivalent to U+007E (~)," for instance, should
never be the default and (because of trouble displaying it) shouldn't be
encouraged.

3. No.  Better to keep everything visible and backward compatible.  U+007E
(~) should remain the preferred way of doing non-breaking space.

4. Not applicable because of the answer to #3.  Users who do insist on
putting U+00A0 in their input presumably have *already* got their own
reasons to think that it's more convenient for them, including solutions
satisfactory to themselves for how to type it on keyboards and see it on
screens, so that's their business and not a problem we need to solve.

-- 
Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/