[XeTeX] A small tip

Thu Nov 17 06:54:06 CET 2005

Hi Will,

On 17/11/2005, at 12:04 AM, Will Robertson wrote:

> Hello,
>
> Don't know how I've used XeTeX for so long without realising the  
> following.
> Since LaTeX can now handle UTF-8 with the inputenc package, it's  
> possible to create a smart em-dash (OPT SHIFT -) that looks good in  
> the source document, only has thin spaces surrounding it, and will  
> allow line breaks after but not before it:
>
> \documentclass{article}
> \def\dash{\unskip\nobreak\thinspace\textemdash\thinspace\ignorespaces}
>
> \expandafter\ifx\csname XeTeXversion\endcsname\relax
>   % if pdfLaTeX:
>   \usepackage[utf8]{inputenc}
>   \DeclareUnicodeCharacter{2014}{\dash}
> \else
>   % if XeLaTeX:
>   \usepackage{fontspec,xunicode}
>   \catcode`\^^^^2014=\active  % or just \catcode`\—=\active
>   \let^^^^2014\dash           % or just \let—\dash
>   \setromanfont{Georgia}
> \fi

Yes; it is a useful aspect of any TeX-based system that this
kind of thing can be done.

But there is a mistake here --- it should be:

    \DeclareUnicodeCharacter{8212}{\dash}

using the decimal form of  2014 (Hex).

Furthermore, for LaTeX, there are minor flaws with your coding.
That is, with just a bit more work you can turn the above into
something more robust, that should continue to work well, even
as new features are added (to XeTeX, say) in the future.

Firstly, a short macro-name such as \dash  is inadvisable,
for a command that a user is never meant to type directly.
There is too big a chance of this name being chosen by an author
as a user-defined macro for something else.

It would be better to use a longer name, such as \smartemdash ,
which matches the concept that you are trying to implement.

In fact, I'd prefer a name that is even more descriptive of
what is being done, such as  \emdashwithspacing .

Secondly, consider what happens when you use the emdash
within a section heading; e.g.,

  \begin{document}
  \tableofcontents
  \section{meow — meow}
   meow — meow
  \end{document}

Look in the .aux file. You get this mess, from expanding non-robust  
commands:

\@writefile{toc}{\contentsline {section}{\numberline {1}meow \unskip  
\penalty \@M \kern .16667em \textemdash \kern .16667em \ignorespaces   
meow}{1}}

By changing your header to the following:

\DeclareRobustCommand{\emdashwithspacing}%
{\unskip\nobreak\thinspace\textemdash\thinspace\ignorespaces}

\expandafter\ifx\csname XeTeXversion\endcsname\relax
   % if pdfLaTeX:
   \usepackage[utf8]{inputenc}
   \DeclareUnicodeCharacter{8212}{\emdashwithspacing}
\else
   % if XeLaTeX:
   \usepackage{fontspec,xunicode}
   \catcode`\^^^^2014=\active  % or just \catcode`\—=\active
   \let^^^^2014\emdashwithspacing           % or just \let—\dash
   \setromanfont{Georgia}
\fi

... you get a much cleaner result in the .aux file (and .toc file):

\@writefile{toc}{\contentsline {section}{\numberline {1}meow  
\emdashwithspacing   meow}{1}}

The point is that now the .aux and .toc files get the macro-name
that expresses the concept, rather than its implementation.

This may seem trivial, since the displayed result appears the same.
It would not be trivial, however, when these auxiliary files are
reused with other applications; e.g., to construct indexes or
hyperlinked tables of contents, of several such documents.

And speaking of hyperlinking, look what happens with  hyperref.sty .
Generating bookmarks via the  .out  file, we see:

   \BOOKMARK [1][]{section.1}{meow\204meow}{}

This comes from a declaration in  pd1enc.def :
      \DeclareTextCommand{\textemdash}{PD1}{\204} % emdash
and may not be what you want here.

Using the LaTeX branch, you can add a declaration:

  \pdfstringdefDisableCommands{\renewcommand{\emdashwithspacing}{ -- }}

to get instead the following:

   \BOOKMARK [1][]{section.1}{meow -- meow}{}

With XeTeX, this doesn't work, as there's not yet a proper
driver  hxetex.def  for  hyperref  use with XeTeX.
(That's something that I'll try to provide sometime.)

>
> \begin{document}
> meow — meow
> \end{document}
>
> Yay. I should do something similar for real unicode curly quotes  
> and Philipp Lehman's great csquotes package...

On another thread, the issue of a Unicode line-separator is being
discussed. What use is it ?

Suppose you have another app that spits out data according to some
standard that requires (or desires) this character, and you want
to process that data with (La)TeX. Then one way to handle those
characters is to make them active, expanding into a (robust) macro
which in turn expands into standard (La)TeX coding.

That is exactly what the above discussion of the emdash is doing.
The only difference is that the line-separator does not need to
leave a visible mark on the page --- just whitespace appropriate
to the formatting of the data.

Hope this helps,

	Ross

>
> Will
>
> _______________________________________________
> XeTeX mailing list
> postmaster at tug.org
> http://tug.org/mailman/listinfo/xetex
>

------------------------------------------------------------------------
Ross Moore                                         ross at maths.mq.edu.au
Mathematics Department                             office: E7A-419
Macquarie University                               tel: +61 +2 9850 8955
Sydney, Australia  2109                            fax: +61 +2 9850 8114
------------------------------------------------------------------------