[texhax] The details of \csname, in this specific case

Mon Feb 25 19:35:22 CET 2013

Doug is essentially correct, I am just trying 
a different way to explain that.

Am Freitag, den 22.02.2013, 20:27 -0500 schrieb Patrick Rutkowski: 
> Now, onto my actual questions. The below TeX code works, but I don't
> exactly know how. I understand what the \catcode is doing, and I
> understand \expandafter is doing. Naturally, I'm also very familiar
> with how UTF-8 internals work, with variable length sequences and all
> that good stuff. What I don't quite understand is what is inside of
> the \csname.
> 
> 1) I would have expected to have to encode c4 and c5 as something like
> `^^c4 and `^^c5 inside of the \csname, but somehow that is not
> required.

Doug did not answer this, and it is difficult for me too ... 
"encoding c4" may be too much of a metaphor, the matter should be 
addressed more directly.

> 2) I thought that \csname took only "character tokens," but wouldn't
> something like "c4" be two separate character tokens, first "c" and
> then "4"?

The \csname construct indeed gets two character tokens from input "c4" 
as well as from input "c5".

> 3) Moreover, what is that colon doing in between the c4 and the #1?

Doug named the same two reasons that came to my mind.

> 4) How exactly does TeX come to interpret the #1 as a "character
> token," aren't things above value 127 by default labeled "invalid?"

I guess the same, and I had to care for this in the fifinddo package 
by a loop turning the character codes into "other". 
Your code may work due to earlier context that does the same. 
I have seen such a loop in the graphics package. From a glance at 
inputenc.sty (LaTeX) I guess that it turns them into "active" 
instead.

I cannot find an answer in The TeXbook quickly about TeX's 
(INITEX's) default (would require reading many pages). 
AFAIK The TeXbook originally was written 
when TeX took 7-bit characters. I would not be surprised 
if the default were left to the installation.
After what I have experienced and seen, I would say that 
it is safe (at least) to "fix" those catcodes before 
trying your code.

> 5) And finally, why exactly is the single quote needed before the ^^c4
> for the \catcode, but not in the \def?

The \catcode "function" or "command" here takes two arguments, 
the first one is a character code. The backquote is a means to 
get that character code from an ensuing token whose name string 
consists of a single character, by taking the latter's character 
code.

There would have been other ways ...

=== [ BEGIN PASTE ] === 
> \catcode`\^^c4=13
> \catcode`\^^c5=13

    \catcode"C4=13 \catcode"C5=13

is a little more straightforward with the same effect 
(provided the double quote has not been made unusual before, 
 as with german.sty).

TeX perceives each of your "ā" etc. characters with macron 
as two character tokens. Call them [X] and [Y]. 
[X] has hex code either C4 or C5. The above code renders 
the character code associated with [X] active for the sequel.

(With heavier typographical means at hand, I would use a 
 different notation.)

> \def^^c4#1#2{\expandafter\def\csname c4:#1\endcsname{#2}}
> \def^^c5#1#2{\expandafter\def\csname c5:#1\endcsname{#2}}

    \def^^c4#1{\expandafter\def\csname c4:#1\endcsname}

should have done, anyway, ...

After these two definitions, [X] is a macro that turns its 
first argument into a macro definition. If [X] has hex 
character code C4, the macro that will be defined has a name 
consisting of the three characters `c', `4', `:', and (as 
fourth) the character associated with the character token 
following [X] when [X] acts as a macro.

I.e., if [X] (having hex character code C4) finds [Y] 
that is the character token formed from the character [Z] 
(assuming [Z] was not a "funny space"), the name of the 
macro that [X] will define will be "c4:[Z]".

[Z] is assumed to be unique for [Y] here, so there is
a mapping from pairs ([X],[Y]) of character tokens to 
"control sequence tokens" [X|Y] such that [X|Y] is a 
token with name "c4:[Z]" above.

This choice of the macro name is quite arbitrary, 
but an intelligible choice.

Actually your idea to use the backquote in the macro name 
reminds me of the situation that [Y] might be an 
active character token, then 

    \def^^c4#1#2{\expandafter\def\csname c4:\string#1\endcsname{#2}}
    \def^^c5#1#2{\expandafter\def\csname c5:\string#1\endcsname{#2}}

would be safer.

`c4' and `c5' in the macro names are very arbitrary, 

    \def^^c4#1#2{\expandafter\def\csname AXXXX\string#1\endcsname{#2}} 
    \def^^c5#1#2{\expandafter\def\csname AXXXXX\string#1\endcsname{#2}}

would have done as well, but may render debugging more difficult. 
There just must be two different 

> ā{\=a}
> ē{\=e}
> ī{\=\i}

E.g., TeX perceives the `ī' starting the previous line 
as [X][Y] where [X] has hex character code C4, and [Y] has A9, see

    http://www.utf8-chartable.de/unicode-utf8-table.pl?number=512 

[X] defines [X|Y] (whose name is the string "c4:[Z]" where 
[Z] has hex code A9) so [X|Y] will expand to the sequence 
of the two tokens associated with \= and \i.

I generally consider the usual notation like `\=' here a reason of 
confusion. TeX typically turns the two input characters `\' and 
`=' into a "control sequence token"(?) whose name is `='. 
It is common to denote this "control sequence token" by `\=', 
but in my view this you always have to wonder whether `\=' 
is a string of two characters or a single character token. 
The TeXbook starts with the first interpretation on page 7 
and then switches to the second interpretation on page 39 
– "A control sequence is considered to be a single object 
that is no longer composed of a sequence of symbols". 
The control sequence was a sequence of symbols until there, 
and it is a kind of sarcastic joke that those non-sequences 
are now called "control sequences".

> ō{\=o}
> ū{\=u}
> Ā{\=A}
> Ē{\=E}
> Ī{\=I}
> Ō{\=O}
> Ū{\=U}

Next that [X] is redefined so that the sequence "[X][Y]" 
of character tokens expands to [X|Y] which in turn expands 
to that sequence of two tokens as above, the first one for 
the macron accent.

> \def^^c4#1{\csname c4:#1\endcsname}
> \def^^c5#1{\csname c5:#1\endcsname}

With the above alternatives for defining [X], this might better be

    \def^^c4#1{\csname c4:\string#1\endcsname}
    \def^^c5#1{\csname c5:\string#1\endcsname}

or 

    \def^^c4#1{\csname AXXXX#1\endcsname}
    \def^^c5#1{\csname AXXXXX#1\endcsname}

> āēīōūĀĒĪŌŪ

TeX perceives the previous line as consisting of [X][Y] pairs. 
For `ī', it sees the control sequence token whose name has 
a single character of hex character code C4, A9 for the next token. 
That pair expands to the control sequence token with name string 
"c4:[Z]" where [Z] is the character whose hex character code is A9. 
That token was referred to as [X|Y], and it expands to the two 
tokens tokenized from the string `\=\i'. The `i' case is somewhat 
special with `a' we have just `\=a', but we need the `i' variant 
without the dot.

Cheers,

    Uwe.