[XeTeX] Whitespace in input

Thu Nov 17 21:49:11 CET 2011

2011/11/17 Ross Moore <ross.moore at mq.edu.au>:
> Hi Phil,
> On 17/11/2011, at 23:53, Philip TAYLOR <P.Taylor at Rhul.Ac.Uk> wrote:
>
> Keith J. Schultz wrote:
>
> You mention in a later post that you do consider a space as a printable
> character.
>
>    This line should read as:
>
>          You mention in a later post that you consider a space as a
> non-printable character.
>
> No, I don't think of it as a "character" at all, when we are talking
> about typeset output (as opposed to ASCII (or Unicode) input).
>
> This is fine, when all that you require of your output is that it be visible
> on
> a printed page. But modern communication media goes much beyond that.
> A machine needs to be able to tell where words and lines end, reflowing
> paragraphs when appropriate and able to produce a flat extraction of all the
> text, perhaps also with some indication of the purpose of that text (e.g. by
> structural tagging).
> In short, what is output for one format should also be able to serve as
> input for another.
> Thus the space certainly does play the role of an output character - though
> the presence of a gap in the positioning of visible letters may serve this
> role in many, but not all, circumstances.
>
> Clearly
> it is a character on input, but unless it generates a glyph in the
> output stream (which TeX does not, for normal spaces) then it is not
> a character (/qua/ character) on output but rather a formatting
> instruction not dissimilar to (say) end-of-line.
>
> But a formatting instruction for one program cannot serve as reliable input
> for another.
> A heuristic is then needed, to attempt to infer that a programming
> instruction must have been used, and guess what kind of instruction it might
> have been. This is not 100% reliable, so is deprecated in modern methods of
> data storage and document formats.
> XML based formats use tagging, rather that programming instructions. This is
> the modern way, which is used extensively for communicating data between
> different software systems.
>
Yes, that's the point. The goal of TeX is nice typographical
appearance. The goal of XML is easy data exchange. If I want to send
structured data, I send XML, not PDF.

> ** Phil.
>
> TeX's strength is in its superior ability to position characters on the page
> for maximum visual effect. This is done by producing detailed programming
> instructions within the content stream of the PDF output. However, this is
> not enough to meet the needs of formats such as EPUB, non-visual reading
> software, archival formats, searchability, and other needs.
> Tagged PDF can be viewed as Adobe's response to address these requirements
> as an extension of the visual aspects of the PDF format. It is a direction
> in which TeX can (and surely must) move, to stay relevant within the
> publishing industry of the future.
>
> Hope this helps,
>      Ross
>
No, it does not help. Remember that tha last (almost) portable version
of PDF is 1.2. If you are to open tagged PDF or even PDF with a
toUnicode map or a colorspace other than RGB or CMYK in Acrobat Reader
3, it displays a fatal error and dies. I reported it to Adobe in March
2001 and they did nothing. I even reported another fatal bug in
January 2001. I sent sample files but nothing happened, Adobe just
stopped development of Acrobat Reader at buggy version 3 for some
operating systems. Why do you so much rely on Adobe? When exchanging
structured documents I will always do it in XML and never create
tagged PDF because I know that some users will be unable to read them
by Adobe Acrobat Reader. I do not wish to make them dependent on
ghostscript and similar tools.
>
> --------------------------------------------------
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>
>


-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz