[XeTeX] Re: XeTeX & Unicode vs. standard LaTeX

Jonathan Kew jonathan_kew at sil.org
Sun Oct 10 21:26:23 CEST 2004


Hi Zsolt,

Thanks for your message. A couple of comments below. (Copied to XeTeX 
list with Zsolt's permission, as I think the response will be of wider 
interest.)

On 9 Oct 2004, at 9:32 pm, Zsolt Kiraly wrote:

> Hi Jonathan,
>
> I saw on the mailing list that there is some discussion on whether 
> XeTeX should be LaTeX compatible regarding curly quotes, dashes, 
> apostrophes, etc. Some people would like complete compatibility, and 
> others think that we would be better off writing our text in pure 
> Unicode with Unicode quotes, Unicode dashes, and so on. But you know 
> all of this.
>
> For me the problem of writing Unicode documents lies in the keyboard. 
> The current Mac keyboards are not built to write Unicode curly quotes 
> and dashes. It is inconvenient to look up the code table for every 
> apostrophe and endash.

The Mac U.S. English keyboard (and other keyboards, I assume) has had 
conventions for entering these characters for a long time: option-[ and 
option-] for opening curly quotes, and shift-option for the closing 
versions; and option-hyphen and shift-option-hyphen for en- and 
em-dashes. But I'm sure many users are unaware of these. Programs like 
MS Word tend to "auto-correct" simple ASCII typing with a "smart 
quotes" feature, etc., and TeX users, of course, are familiar with its 
ASCII-based conventions, which are often more convenient to type than 
the modifier-key combinations used in the MacRoman layouts.

>  Maybe the solution would be in the use of a preprocessor that 
> converted standard LaTeX quotes and dashes, etc into their Unicode 
> equivalents and gave its output to XeTeX to process. People who wanted 
> LaTeX compatibility would be happy, and people who wanted straight 
> Unicode would have the ability to turn off the preprocessor.
>
> The T1.enc file has a set of standard LaTeX ligatures to enforce, 
> although the ' apostrophe would still need to be mapped to the curly 
> apostrophe.
>
> All of this must be transparent to the user, and a simple option to 
> the XeTeX executable should be enough to turn the preprocessor on or 
> off. This way \include-ed files and BibTeX and index files would also 
> be automatically preprocessed if the option is on.
>
> Do you think this would solve a lot of people's problems ? I'd be 
> interested in any thoughts you might have on this subject.

I don't think a preprocessor is the right way to solve this. For one 
thing, it would be impossible for a preprocessor (unless it included a 
full TeX parser and macro system!) to know whether there might be 
instances of "--", for example, that *shouldn't* be converted to 
\char"2013. Would this be a problem in practice? Yes! Imagine 
typesetting a document that includes fragments of C/C++ source code; 
"--" is a common C operator.

These TeX conventions are actually implemented as ligatures, and the 
right place to solve the problem is where ligatures are defined: at the 
font level. It would be possible for AAT or OpenType fonts to include 
ligature rules for these typical TeX conventions. (Note, incidentally, 
that not all the standard TeX fonts implement the same set of 
ligatures; there's no "--" ligature in cmtt, for example. This is also 
a clue that a preprocessor, which would be unaware of fonts, is not the 
answer.)

However, we obviously cannot expect mainstream font vendors to add 
support for TeX's unique keying conventions to their font tables. 
Therefore, I have just implemented a "font mapping" scheme (this was 
first suggested on the XeTeX list by Ross Moore, IIRC), which allows an 
arbitrary mapping of Unicode character sequences to be associated with 
a particular font. So having defined a mapping "tex-text" that includes 
entries such as:

     U+002D U+002D         >  U+2013 ; endash
     U+002D U+002D U+002D  >  U+2014 ; emdash
     U+0060 U+0060         >  U+201C ; opening double quote
     ; etc....

I can then load a font with a command like

     \font\pal = "Palatino:mapping=tex-text" at 12pt

and whenever this font is used, XeTeX will pass the Unicode character 
sequence to be typeset (at the lowest level, after all macro expansion, 
etc.) through this mapping, and the standard TeX ligatures will work as 
expected.

This was just implemented on Friday, and seems to be working well. It 
will be present in the next release of XeTeX (along with that OpenType 
ligature bug-fix, and perhaps another feature or two). Stay tuned! :-)

Jonathan



More information about the XeTeX mailing list