[XeTeX] Whitespace in input

Tobias Schoel liesdiedatei at googlemail.com
Mon Nov 14 18:42:29 CET 2011



Am 14.11.2011 18:30, schrieb mskala at ansuz.sooke.bc.ca:
> 1.  No.  That is not what Unicode is for.  Unicode's goal is to subsume
> all reasonable pre-existing encodings.
Unicode is even more. Look at all the Annexes to Unicode 6.0

  Some reasonable pre-existing
> encodings include a non-breaking space character, so Unicode includes one.
> That does not mean Unicode says you should actually use it!  There are
> many precedents of Unicode providing multiple ways of representing
> things, as a result of including characters from other systems, without
> it being reasonable to demand that all Unicode-compatible systems must
> support all of them.  For instance, most of the U+FFxx range is devoted
> to different kinds of hacks for handling partial-width characters in
> Asian-language typesetting; the preferred way to do that nowadays is via
> OpenType features, but the code points remain in the standard.  The U+0000
> to U+001F range is basically control characters for Teletype machines;
> some of those, like U+000A and U+000D, are widely used in modern documents
> (but in varying ways by different systems!) and others, like U+001D, are
> virtually unheard-of.  Unicode does NOT say everybody has to support them
> all let alone all in the same way.
Hmm, I have difficulties exactly understanding the conformance chapter 
of Unicode 6.0 ( http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf 
), but it seems to me, that claiming unicode support seems a very strong 
statement.

>
> The U+00A0 code points is not explicitly deprecated in Unicode, but it was
> never a principle of Unicode that all implementations have to support all
> defined control characters regardless of appropriateness to the particular
> purpose.  "Non-breaking space" is, from TeX's point of view, not really a
> character at all, but a formatting command; and TeX already has a way of
> dealing with formatting commands in general and this one in particular.
> It is appropriate to say that the preferred way of handling non-breaking
> spaces in TeX input is the existing TeX way; and saying that in NO WAY AT
> ALL contradicts anything in Unicode.  Unicode is servant, not master.
I think it's more like math being servant _and_ master of natural sciences.
>
> 2. Inevitably, people will include invalid characters in TeX input; and
> U+00A0 is an invalid character for TeX input.  The best way to deal with
> it is to treat it like any other invalid character and generate an error
> message.  A reasonable alternative would be to say "it is whitespace; it
> will be treated like other whitespace."  That would mean ignoring its
> breaking/non-breaking-ness, as we have for a long time similarly ignored
> the special properties of U+0009 (tab).  Of course, if users want to
> define a special meaning for U+00A0 in their own input, they can do so
> with the existing mechanisms for redefining the meanings of input
> characters; but "U+00A0 is equivalent to U+007E (~)," for instance, should
> never be the default and (because of trouble displaying it) shouldn't be
> encouraged.
Now we come to the trouble of Unicode specifying a line-breaking 
algorithm ( http://www.unicode.org/reports/tr14/tr14-26.html ), which 
probably isn't exactly TeX's. I'm not into these algorithms, so I can't 
compare. But I would ask some Master of this Art to speak up about this 
conflict.

>
> 3. No.  Better to keep everything visible and backward compatible.  U+007E
> (~) should remain the preferred way of doing non-breaking space.
Should and is … (see other posts).
>
> 4. Not applicable because of the answer to #3.  Users who do insist on
> putting U+00A0 in their input presumably have *already* got their own
> reasons to think that it's more convenient for them, including solutions
> satisfactory to themselves for how to type it on keyboards and see it on
> screens, so that's their business and not a problem we need to solve.
>
I'm personally trying hard to find a correct way. As of now, I have 
found a very simple solution to input special whitespace characters. 
(Using Linux, doing this is easy business with ibus.) Alas, I haven't 
found any editor suited better to my TeX needs than Kile, but I haven't 
yet managed to highlight these special whitespace characters properly.
=> Some experts can do all these things. That doesn't mean, everyone 
else should stick do "stupid old" ASCII-7.

bye

Toscho


More information about the XeTeX mailing list