[XeTeX] Whitespace in input

Ross Moore ross.moore at mq.edu.au
Thu Nov 17 20:36:27 CET 2011


Hi Phil,

On 17/11/2011, at 23:53, Philip TAYLOR <P.Taylor at Rhul.Ac.Uk> wrote:

> Keith J. Schultz wrote:
>>> 
>>> You mention in a later post that you do consider a space as a printable character.
>>    This line should read as:
>>          You mention in a later post that you consider a space as a non-printable character.
> 
> No, I don't think of it as a "character" at all, when we are talking
> about typeset output (as opposed to ASCII (or Unicode) input).  

This is fine, when all that you require of your output is that it be visible on
a printed page. But modern communication media goes much beyond that.
A machine needs to be able to tell where words and lines end, reflowing paragraphs when appropriate and able to produce a flat extraction of all the text, perhaps also with some indication of the purpose of that text (e.g. by structural tagging).

In short, what is output for one format should also be able to serve as input for another.

Thus the space certainly does play the role of an output character – though the presence of a gap in the positioning of visible letters may serve this role in many, but not all, circumstances.

> Clearly
> it is a character on input, but unless it generates a glyph in the
> output stream (which TeX does not, for normal spaces) then it is not
> a character (/qua/ character) on output but rather a formatting
> instruction not dissimilar to (say) end-of-line.

But a formatting instruction for one program cannot serve as reliable input for another.
A heuristic is then needed, to attempt to infer that a programming instruction must have been used, and guess what kind of instruction it might have been. This is not 100% reliable, so is deprecated in modern methods of data storage and document formats.
XML based formats use tagging, rather that programming instructions. This is the modern way, which is used extensively for communicating data between different software systems.

> 
> ** Phil.

TeX's strength is in its superior ability to position characters on the page for maximum visual effect. This is done by producing detailed programming instructions within the content stream of the PDF output. However, this is not enough to meet the needs of formats such as EPUB, non-visual reading software, archival formats, searchability, and other needs.
Tagged PDF can be viewed as Adobe's response to address these requirements as an extension of the visual aspects of the PDF format. It is a direction in which TeX can (and surely must) move, to stay relevant within the publishing industry of the future.


Hope this helps,

     Ross
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20111118/9f69c7f2/attachment.html>


More information about the XeTeX mailing list