# [XeTeX] Input encodings, etc., was...

Will Robertson will at guerilla.net.au
Fri Nov 18 07:32:01 CET 2005

On 18/11/2005, at 1pm, Ross Moore wrote:
> The  \DeclareUTFcharacter  produces the UTF8 (Unicode) point for
> a character, using the same macro-name that your were already using.

But you don't think that having \DeclareUTFcharacter and
\DeclareUnicodeCharacter do somewhat opposite things is confusing? In
xunicode, you say \textsection is actually the UTF8 character 00A7
(output/font encoding):

\DeclareUTFcharacter[\UTFencname]{x00A7}{\textsection}

In  utf8enc.dfu , they say that a UTF8 character (input encoding) is
actually a \textsection:

\DeclareUnicodeCharacter{00A7}{\textsection}

My point is that the two commands look very similar, but don't do the
same thing; my original suggestion was no good, now that I understand
what's going on. Ideally, I'd call them something like
\DeclareUnicodeOutputChar and \DeclareUnicodeInputChar
(respectively), but we can't change the inputenc definition, of course.

As an aside, wouldn't it be cleaner to write
\DeclareUTFcharacter{x00A7}{\textsection}
instead because the optional argument is [\UTFencname] by default?
I'm happy to make any of the changes that I suggest to you, but I
don't know how you'd feel about that...

> Furthermore, there is no documentation on these macro names, since
> ideally *all of them* are deprecated --- ultimately the strategy
> for new documents with XeTeX is to type the UTF characters directly
> into the document source.

Just a question about this last point; how do people actually enter
most of the non-ascii characters directly into their source? I'm all
for being able to type a literal alpha in my math sources to get a
math alpha, but unless the text editor "auto-completes" this for me,
along the lines of how iTeXMac 2 does (although only for viewing), I
don't see how this is actually convenient to use...

(Will we, in the future, have multiple keyboard on our desks that
contain different subsets of unicode?)

> Furthermore, \DeclareUTFcharacter  does its work "nicely".
> That is, you only get the UTF8 character when the input-encoding
> is 'U'. So you need to use something like what  \setromanfont
> does, to state that you have a font with more capabilities than
> the old 8-bit fonts used by traditional TeX systems.
>
>  xunicode  does not supply these fonts, nor access to such fonts,
> whereas the inputenc-modules *do* supply both macros and the way
> to access requisite fonts.

Are you getting mixed up here between input and output (font) encodings?

inputenc doesn't provide the mechanism to access fonts; it provides a
mapping from the input encoding to the LICR. From the LICR, the font
encoding then determines how to get the correct glyphs (or built-up
characters) in the current font.

And again, the font encoding is smart in that it adapts its
definitions on the fonts being used: for old-style TeX fonts, it'll
fake an accent with the \accent primitive; for a newer T1 font it can
insert the pre-drawn glyph.

xunicode performs a job that's much more similar (or even identical)
to the font encoding. Whether the macros that it's converting are
actual macros from the LICR or more convenient user shortcuts (\S vs.
\textsection, for example) is less relevant.

> <snip>
>
> [ucsencs.def] gives macro-names for a lot more characters, such as
> modern Greek and Hebrew.
>
> These *should* be added to  xunicode.sty .
> But in doing so, it would indeed be useful to have a modular
> structure,
> with options for loading just those bits which are needed.
> This is what you asked for, so I may up its priority.

I think further developments should look at how, if at all,
multiple .fd files can be used and loaded for various subsets of the
unicode space. But I'm not so convinced that people actually *input*
Greek and Hebrew using these macros. They are used, as far as I
understood, to implement the decoupling of input and output encodings
TeX, which is much less relevant with XeTeX since unicode is
supported the whole way through.

The conversation we had with the LaTeX3 people just over a year ago
helped me very much to understand what was going on with input and
font encodings, but I'm still working out the ramifications in my head.

For example, it doesn't seem useful at all to make all higher plane
characters active in XeTeX so they can be converted to TeX-normalised
macros, which in turn refer to unicode characters in the output. The
only advantage to this scheme might be to implement a font fall-back
mechanism, but I think this would be more efficiently done within
XeTeX itself.

I've ended up talking about much more than I intended; some of it was
explanation for my own sake, but hopefully it all made sense...what