[XeTeX]   in XeTeX

Ross Moore ross.moore at mq.edu.au
Sun Nov 13 23:14:28 CET 2011

Hi all,

On 14/11/2011, at 7:55 AM, Zdenek Wagner wrote:

> Before typing a document one should think what will be the purpose of
> it. If the only purpose is to have it typeset by (La)TeX, I would just
> use well known macros and control symbols (~, $, &, %, ^, _). If the
> text should be stored in a generic database, I cannot use ~ because I
> do not know whether it will be processed by TeX. I cannot use  
> because I do not know whether it will be processed by HTML aware
> tools. I cannot even use   because the tool used for processing
> the exported data may not understand entities at all. In such a case I
> must use U+00a0 and make sure that the tool used for processing the
> data knows how to handle it, or I should plug in a preprocessor. 

This is exactly correct.
Text will be entered into whatever tools, for storing data.
Such text may well contain characters (rightly or wrongly) that
have not traditionally been used in (La)TeX typesetting.

Thus the problem is: "what should be the default (Xe)TeX behaviour
when encountering such characters in the input stream?"

Currently there is no part of building the XeTeX.def format
that handles these, apart from"

   "00A0  (= u00a0)  being set to have \catcode 12.
        see the coding of  xetex.ini 

Nothing sets any properties of characters in the range:

   U+02000 --> U+0200F ,  U+02028 --> U+0202F

apart from perhaps in  bidi.sty  which needs the RTL and
LTR marks, ZWNJ and maybe some others.
But  bidi.sty  is optionally loaded by the user, so does not 
count here as the *default* behaviour for XeTeX-based formats.

The result is that these characters just pass through to the 
output, as part of a character string within the PDF,
*provided* the font supports them.

However, the tradition .tfm-based TeX fonts just treat these 
as missing characters, contributing zero to the metric width.
There'll be a message in the .log file:

>>> Missing character: There is no   in font cmr10!
>>> Missing character: There is no   in font cmr10!
>>> Missing character: There is no   in font cmr10!
>>> Missing character: There is no   in font cmr10!

This seems like a reasonable default behaviour, especially
in light of the lack of consensus to do anything else.

One slight problem is that those "Missing character" messages
do not go to the "console" window, but only to the .log  file.
Many users will not notice this.

Although this is just following TeX's design, and was quite sensible 
when TeX was just using its own CMR fonts, I think that XeTeX 
should have directed such warning messages also to the Console. 

XeTeX has stepped out of the tightly controlled environment 
of traditional TeX jobs, so should also have re-thought about 
what are "errors" and "warnings" and extra technical information, 
and how relevant these would be to users/authors.

The point here is that users might simply not notice that 
some of the characters in their inputs may not have not been 
processed in the best possible way.
This would be particularly the case for characters that have
no visible rendering, but just insert extra space, as are
being discussed in this thread.

>>> Where would such a default take place:
>>> - XeTeX engine
>>> - XeLaTeX format
>>> - some package (xunicode, fontspec, some new package)

xunicode  doesn't handle the meaning of non-ascii input.
It is designed primarily for mapping legacy ascii-style input
(via macro-names) to the best-possible Unicode code-point(s).

fontspec  isn't right either, as we are talking about spacing,
not actual printed characters from a font.

>>> - my own package/preamble template
>> None of these ?  In a stand-alone file that can be \input
>> by Plain XeTeX users, by XeLaTeX users, and by XeLaTeX
>> package authors.

I think that this counts as a "package", just using a .tex
(or other) suffix, rather than necessarily .sty .

A TEC-kit mapping file is another place where these characters
can be processed; e.g. removed, if there is no need for them
to be part of the final PDF output.

However this inhibits the possibility of earlier logic
being easily applied, to test the context of the role
of these special characters and act accordingly.

>> In a future XeTeX variant (if such a thing comes to exist),
>> the functionality could be built into the engine.

Certainly some default behaviour could be included.
But what is best?

Assigning a \catcode of 10 would be appropriate in
some situations, for some characters.
Making some characters active, then giving an expansion,
would be appropriate in other situations.

Packages could be written for these situations.

But then, as always, it is up to the users to recognise
the issues, for their own particular data and their
own output requirements, then choose packages accordingly.

>> My EUR 0,02 (while we still have one).
>> ** Phil.

Is there a EUR 0,01 coin?   :-)
We lost our AUD 0.01 and 0.02 coins long ago.
There is even talk now of dropping the 0.05 one.

> -- 
> Zdeněk Wagner
> http://hroch486.icpf.cas.cz/wagner/
> http://icebearsoft.euweb.cz



Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-419      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114

More information about the XeTeX mailing list