[XeTeX] Input encodings, etc., was...

Fri Nov 18 07:32:01 CET 2005

On 18/11/2005, at 1pm, Ross Moore wrote:
> The  \DeclareUTFcharacter  produces the UTF8 (Unicode) point for
> a character, using the same macro-name that your were already using.

But you don't think that having \DeclareUTFcharacter and  
\DeclareUnicodeCharacter do somewhat opposite things is confusing? In  
xunicode, you say \textsection is actually the UTF8 character 00A7  
(output/font encoding):

   \DeclareUTFcharacter[\UTFencname]{x00A7}{\textsection}

In  utf8enc.dfu , they say that a UTF8 character (input encoding) is  
actually a \textsection:

   \DeclareUnicodeCharacter{00A7}{\textsection}

My point is that the two commands look very similar, but don't do the  
same thing; my original suggestion was no good, now that I understand  
what's going on. Ideally, I'd call them something like  
\DeclareUnicodeOutputChar and \DeclareUnicodeInputChar  
(respectively), but we can't change the inputenc definition, of course.

As an aside, wouldn't it be cleaner to write
   \DeclareUTFcharacter{x00A7}{\textsection}
instead because the optional argument is [\UTFencname] by default?
I'm happy to make any of the changes that I suggest to you, but I  
don't know how you'd feel about that...

> Furthermore, there is no documentation on these macro names, since
> ideally *all of them* are deprecated --- ultimately the strategy
> for new documents with XeTeX is to type the UTF characters directly
> into the document source.

Just a question about this last point; how do people actually enter  
most of the non-ascii characters directly into their source? I'm all  
for being able to type a literal alpha in my math sources to get a  
math alpha, but unless the text editor "auto-completes" this for me,  
along the lines of how iTeXMac 2 does (although only for viewing), I  
don't see how this is actually convenient to use...

(Will we, in the future, have multiple keyboard on our desks that  
contain different subsets of unicode?)

> Furthermore, \DeclareUTFcharacter  does its work "nicely".
> That is, you only get the UTF8 character when the input-encoding
> is 'U'. So you need to use something like what  \setromanfont
> does, to state that you have a font with more capabilities than
> the old 8-bit fonts used by traditional TeX systems.
>
>  xunicode  does not supply these fonts, nor access to such fonts,
> whereas the inputenc-modules *do* supply both macros and the way
> to access requisite fonts.

Are you getting mixed up here between input and output (font) encodings?

inputenc doesn't provide the mechanism to access fonts; it provides a  
mapping from the input encoding to the LICR. From the LICR, the font  
encoding then determines how to get the correct glyphs (or built-up  
characters) in the current font.

And again, the font encoding is smart in that it adapts its  
definitions on the fonts being used: for old-style TeX fonts, it'll  
fake an accent with the \accent primitive; for a newer T1 font it can  
insert the pre-drawn glyph.

xunicode performs a job that's much more similar (or even identical)  
to the font encoding. Whether the macros that it's converting are  
actual macros from the LICR or more convenient user shortcuts (\S vs.  
\textsection, for example) is less relevant.

> <snip>
>
> [ucsencs.def] gives macro-names for a lot more characters, such as  
> modern Greek and Hebrew.
>
> These *should* be added to  xunicode.sty .
> But in doing so, it would indeed be useful to have a modular  
> structure,
> with options for loading just those bits which are needed.
> This is what you asked for, so I may up its priority.

I think further developments should look at how, if at all,  
multiple .fd files can be used and loaded for various subsets of the  
unicode space. But I'm not so convinced that people actually *input*  
Greek and Hebrew using these macros. They are used, as far as I  
understood, to implement the decoupling of input and output encodings  
TeX, which is much less relevant with XeTeX since unicode is  
supported the whole way through.

The conversation we had with the LaTeX3 people just over a year ago  
helped me very much to understand what was going on with input and  
font encodings, but I'm still working out the ramifications in my head.

For example, it doesn't seem useful at all to make all higher plane  
characters active in XeTeX so they can be converted to TeX-normalised  
macros, which in turn refer to unicode characters in the output. The  
only advantage to this scheme might be to implement a font fall-back  
mechanism, but I think this would be more efficiently done within  
XeTeX itself.

I've ended up talking about much more than I intended; some of it was  
explanation for my own sake, but hopefully it all made sense...what  
are your thoughts?

Will