[XeTeX] Whitespace in input

Ross Moore ross.moore at mq.edu.au
Wed Nov 16 00:53:56 CET 2011


Hi Phil,

On 16/11/2011, at 10:08 AM, Zdenek Wagner wrote:

>> How do you explain to somebody the need to do something really,
>> really special to get a character that they can type, or copy/paste?
>> 
>> There is no special role for this character in other vital aspects
>> of how TeX works, such as there is for $ _ # etc.
>> 
>> 
>>>> 
>>>> In TeX ~ *simulates* a non-breaking space visually, but there is
>>>> no actual character inserted.
>>> 
>>> And I don't agree that a space is a character, non-breaking or not !
>> 
>> In this view you are against most of the rest of the world.
>> 
> TeX NEVER outputs a space as a glyph. Text extraction tools usually
> interpret horizontal spaces of sufficient size as U+0020.

I never said that it did, nor that it was necessary to do so.

Those text extraction tools do a pretty reasonable job, but don't
always get it right. Besides, there is reliance on a heuristic,
which can be fallible, especially if there is content typeset in 
a very small font size.
And what about at line-ends? They can get that wrong too.

Such a reliance is rather against the TeX way of doing things,
don't you think?

Better is for TeX itself to apply the heuristic, since it knows
the current font size and the separation between bits of words.

> (The exception to the above mentioned "never" is the verbatim mode.)

That isn't good enough for TeX to produce PDF/A.
Go and watch the videos that I pointed you to.


Lower down I give a run-down of how a variant of TeX handles
this problem, to very good effect.

> 
>> If the output is intended to be PDF, as it really has to be with
>> XeTeX, then the specifications for the modern variants of PDF
>> need to be consulted.
>> 
>> With PDF/A and PDF/UA and anything based on ISO-32000 (PDF 1.7)
>> there is a requirement that the included content should explicitly
>> provide word boundaries. Having a space character inserted is by
>> far the most natural way to meet this specification.
> 
> A space character is a fixed-width glyph. If you insist in it, you
> will never be able to typeset justified paragraphs, you will move back
> to the era of mechanical typewriters.

Absolutely wrong!

I'm not insisting on it being included as the natural way to 
separate words within the PDF, though it certainly is a possible
way that is used by other software.

>> (This does not mean that having such a character in the output
>> need affect TeX's view of typesetting.)

Clearly you never even read this parenthetical statement ...

>> 
>> Before replying to anything in the above paragraph, please
>> watch the video of my recent talk at TUG-2011.

 ... and certainly you don't seem to have followed up on this
piece of advice, to get a better perspective of what I'm talking
about.

>> 
>>  http://river-valley.tv/further-advances-toward-tagged-pdf-for-mathematics/
>> 
>> or similar from earlier years where I also talk a bit about such things.



Here is how you get *both* TeX-quality typesetting and explicit
spaces as word-boundaries inside the PDF, with no loss of quality.

What the experimental tagged-pdfTeX does is to use a font (called
"dummy-space") that contains just a single character at code Ux0020,
at a size that is almost zero -- it cannot be exactly zero, else 
PDF browsers may not select it for copy/paste, or other text-extraction.

These extra spaces are inserted into the PDF content stream, *after*
TeX has determined the correct positioning for high-quality typesetting.
That is, it is *not* done by macros or widgets or suchlike, but is
done internally by the pdfTeX engine at shipout time.

The almost-zero size has no perceptible effect on the visual output.
But the existence of these extra space characters means that all
text-extraction methods work much more reliably.

There *are* extra primitives that can be used to turn this off and on
in places where such extra spaces are not wanted; e.g. in math.
And there is a primitive to insert such a space, in case it is required
manually, for whatever reason. All of these primitives are used
extensively when generating tagged PDF of mathematical expressions,
and are thus available for other usage too.


>> 
>>> 
>>> ** Phil.

Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------






More information about the XeTeX mailing list