[XeTeX] Whitespace in input

Thu Nov 17 22:54:32 CET 2011

Hi Phil,

On 18/11/2011, at 6:56 AM, Philip TAYLOR wrote:

> Ross, I do not dispute your arguments : I was answering
> Keith's question in an honest way.  I (personally) do not
> think of a space in TeX output as a character at all,
> because I am steeped in TeX philosophy; but I am quite
> willing to accept that /if/ the objective is not to
> produce output for the sake of output, but output for
> subsequent processing as input by another program, then
> there /may/ be an argument for outputting a space as a
> variable-width glyph.
> 
> However, I do think that what appears in the output stream
> is a secondary consideration; far more important (IMHO) is
> how we represent that space /within XeTeX/.  

Do you realise how XeTeX works?
Especially when handling non-Latin-based languages?

Essentially it does *nothing at all* after macro expansion.

Instead it passes strings of characters (tokens are converted back 
to characters) to an external process --- namely the font-handling
aspects provided by the computers operating system, or other
software. What returns is a piece of PDF output, along with 
height/depth/width of this piece (i.e. a TeX-like box). 

It is external software, that has been designed to encode the
knowledge of how the particular language script is structured.
This makes all the detailed description of character placement,
perhaps using information contained within the font itself.

Indeed for many fonts, there are no such decisions, since the
font actually does it itself. All that is needed is to place
the character string in the most appropriate position on the page.

XeTeX does play a role in determining whether the box fits on the
line being built. If not, then hyphenation points come into play,
so that alternative break-ups of the character string into smaller
pieces must be considered.

Why am I giving this detail of a description? ...

> There is, I am
> sure, not a suggestion on the table that we start to treat
> a conventional space in XeTeX other than as TeX has traditionally
> treated it, and therefore the real question is (to my mind),
> "do we adopt an extension of this traditional TeX treatment
> for non-breaking space, thin-space, and any of the other
> not-quite-standard spaces that Unicode encompasses,

 ... 
Well what if those "not-quite-standard" space characters
actually play a vital role in the layout of a language script?

Indeed some of them do. For instance, other threads on this
XeTeX list are talking about ZWJ and ZWNJ, and I've already
mentioned things like the LTR and RTL indicators.

Almost certainly many of the other characters are handled
specially already by the OS software that XeTeX passes the
main decisions to. So changing this at input level for XeTeX
could completely change the visual appearance of the output,
in ways that TeX software has no way to fix.

In other terms, those extra space "characters" are programming
instructions for other non-TeX-based software. XeTeX needs to 
pass them on unchanged, if that software is to give back to
XeTeX the high-quality typeset output building blocks that 
it needs to position on the page.

By accepting Unicode input, and passing it along to other
software, TeX has inherited the ability to handle many, many
more languages and scripts than it ever could do properly before.
This is as well as making a much richer set of fonts available
for use in XeTeX-produced PDFs.

It does these things by piggy-backing on the work of others, 
developed by people who might have absolutely no idea of what TeX 
is, nor how it works, and probably would not care even if they did.
It is a win-win all round --- something that is very rare these days. 

But this does come with a price.
It means that XeTeX-produced output can be OS dependent, 
unlike with other TeX software!

Also, successful compilation to the desired output can be
dependent on having the correct version of a font installed.
Many posts on the XeTeX list have been about such issues.

> or do
> we look for an alternative model which /might/ be glyph-
> or character-based ?".

My view is "no we should not", at least not to become
the default way that XeTeX handles its input.

By all means write packages that can be used in particular 
situations where such characters are producing observable
unwanted effects on the final output.
But this should be done at the package level 
(e.g. by a \catcode change, and macro definition).

Then the source document will have a line in the preamble
that indicates that there could be a deviation from default
behaviours. This is an indication that there is something
special about the source stream, and someone with appropriate
knowledge has worked out how to deal with it.

But for general (default) usage, the non-ASCII characters
representing Unicode code-points that go in should be treated
as exactly those Unicode code-points. 

Alternatively, use the editor to change the unwanted characters 
to ordinary spaces, or whatever else works well with TeX processing.

This is actually my preference in these situations, as there is 
a definite advantage in keeping the (La)TeX input source clean.
At some time you might want to use it with a different processor,
which might not have an easy in-built way to handle the problematic
characters. 

> 
> ** Phil.

Hope this helps clarify any misconceptions,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------