[latex3-commits] [git/LaTeX3-latex3-latex3] master: Add to l3term-glossary a description of tokens [ci skip] (d7893f0fa)

Wed Jul 15 12:01:12 CEST 2020

Repository : https://github.com/latex3/latex3
On branch  : master
Link       : https://github.com/latex3/latex3/commit/d7893f0fa2053968d54dc9788a9277889626a9e4

>---------------------------------------------------------------

commit d7893f0fa2053968d54dc9788a9277889626a9e4
Author: Bruno Le Floch <bruno at le-floch.fr>
Date:   Wed Jul 15 12:01:12 2020 +0200

    Add to l3term-glossary a description of tokens [ci skip]


>---------------------------------------------------------------

d7893f0fa2053968d54dc9788a9277889626a9e4
 l3kernel/doc/l3term-glossary.tex | 93 +++++++++++++++++++++++++++++++++++++++-
 l3kernel/l3token.dtx             | 40 +++++++++++------
 2 files changed, 119 insertions(+), 14 deletions(-)

diff --git a/l3kernel/doc/l3term-glossary.tex b/l3kernel/doc/l3term-glossary.tex
index 99464b67c..76ad32b9b 100644
--- a/l3kernel/doc/l3term-glossary.tex
+++ b/l3kernel/doc/l3term-glossary.tex
@@ -53,7 +53,98 @@ beginning of a line.
 
 \section{Structure of tokens}
 
-Copy there the section ``Description of all possible tokens'' from \texttt{l3token}.
+We refer to the documentation of \texttt{l3token} for a complete
+description of all \TeX{} tokens.  We distinguish the meaning of the
+token, which controls the expansion of the token and its effect on
+\TeX{}'s state, and its shape, which is used when comparing token lists
+such as for delimited arguments.  At any given time two tokens of the
+same shape automatically have the same meaning, but the converse does
+not hold, and the meaning associated with a given shape change when
+doing assignments.
+
+Apart from a few exceptions, a token has one of the following shapes.
+\begin{itemize}
+  \item A control sequence, characterized by the sequence of characters
+    that constitute its name: for instance, \cs{use:n} is a five-letter
+    control sequence.
+  \item An active character token, characterized by its character code
+    (between $0$ and $1114111$ for \LuaTeX{} and \XeTeX{} and less for
+    other engines) and category code~$13$.
+  \item A character token such as |A| or |#|, characterized by its
+    character code and category code (one of $1$, $2$, $3$, $4$, $6$,
+    $7$, $8$, $10$, $11$ or~$12$ whose meaning is described below).
+\end{itemize}
+
+The meaning of a (non-active) character token is fixed by its category
+code (and character code) and cannot be changed.  We call these tokens
+\emph{explicit} character tokens.  Category codes that a character token
+can have are listed below by giving a sample output of the \TeX{}
+primitive \tn{meaning}, together with their \LaTeX3 names and most
+common example:
+\begin{itemize}
+  \item[1] begin-group character (|group_begin|, often |{|),
+  \item[2] end-group character (|group_end|, often |}|),
+  \item[3] math shift character (|math_toggle|, often |$|), % $
+  \item[4] alignment tab character (|alignment|, often |&|),
+  \item[6] macro parameter character (|parameter|, often |#|),
+  \item[7] superscript character (|math_superscript|, often |^|),
+  \item[8] subscript character (|math_subscript|, often |_|),
+  \item[10] blank space (|space|, often character code~$32$),
+  \item[11] the letter (|letter|, such as |A|),
+  \item[12] the character (|other|, such as |0|).
+\end{itemize}
+Category code~$13$ (|active|) is discussed below.  Input characters can
+also have several other category codes which do not lead to character
+tokens for later processing: $0$~(|escape|), $5$~(|end_line|),
+$9$~(|ignore|), $14$~(|comment|), and $15$~(|invalid|).
+
+The meaning of a control sequence or active character can be identical
+to that of any character token listed above (with any character code),
+and we call such tokens \emph{implicit} character tokens.  The meaning
+is otherwise in the following list:
+\begin{itemize}
+  \item a macro, used in \LaTeX3 for most functions and some variables
+    (|tl|, |fp|, |seq|, \ldots{}),
+  \item a primitive such as \tn{def} or \tn{topmark}, used in \LaTeX3
+    for some functions,
+  \item a register such as \tn{count}|123|, used in \LaTeX3{} for the
+    implementation of some variables (|int|, |dim|, \ldots{}),
+  \item a constant integer such as \tn{char}|"56| or
+    \tn{mathchar}|"121|, used when defining a constant using
+    \cs{int_const:Nn},
+  \item a font selection command,
+  \item undefined.
+\end{itemize}
+Macros can be \tn{protected} or not, \tn{long} or not (the opposite of
+what \LaTeX3 calls |nopar|), and \tn{outer} or not (unused in \LaTeX3).
+Their \tn{meaning} takes the form
+\begin{quote}
+  \meta{prefix} |macro:|\meta{argument}|->|\meta{replacement}
+\end{quote}
+where \meta{prefix} is among \tn{protected}\tn{long}\tn{outer},
+\meta{argument} describes parameters that the macro expects, such as
+|#1#2#3|, and \meta{replacement} describes how the parameters are
+manipulated, such as~|\int_eval:n{#2+#1*#3}|.  This information can be
+accessed by \cs{cs_prefix_spec:N}, \cs{cs_argument_spec:N},
+\cs{cs_replacement_spec:N}.
+
+When a macro takes an undelimited argument, explicit space characters
+(with character code $32$ and category code $10$) are ignored.  If the
+following token is an explicit character token with category code $1$
+(begin-group) and an arbitrary character code, then \TeX{} scans ahead
+to obtain an equal number of explicit character tokens with category
+code $1$ (begin-group) and $2$ (end-group), and the resulting list of
+tokens (with outer braces removed) becomes the argument.  Otherwise, a
+single token is taken as the argument for the macro: we call such single
+tokens \enquote{N-type}, as they are suitable to be used as an argument
+for a function with the signature~\texttt{:N}.
+
+When a macro takes a delimited argument \TeX{} scans ahead until finding
+the delimiter (outside any pairs of begin-group/end-group explicit
+characters), and the resulting list of tokens (with outer braces
+removed) becomes the argument.  Note that explicit space characters at
+the start of the argument are \emph{not} ignored in this case (and they
+prevent brace-stripping).
 
 \section{Quantities and expressions}
 
diff --git a/l3kernel/l3token.dtx b/l3kernel/l3token.dtx
index d12c5deee..a047c4dd1 100644
--- a/l3kernel/l3token.dtx
+++ b/l3kernel/l3token.dtx
@@ -992,11 +992,7 @@
 %     other engines) and category code~$13$.
 %   \item A character token, characterized by its character code and
 %     category code (one of $1$, $2$, $3$, $4$, $6$, $7$, $8$, $10$,
-%     $11$ or~$12$ whose meaning is described below).\footnote{In
-%     \LuaTeX{}, there is also the case of \enquote{bytes}, which behave as
-%     character tokens of category code $12$~(other) and character code
-%     between $1114112$ and~$1114366$.  They are used to output
-%     individual bytes to files, rather than UTF-8.}
+%     $11$ or~$12$ whose meaning is described below).
 % \end{itemize}
 % There are also a few internal tokens.  The following list may be
 % incomplete in some engines.
@@ -1017,6 +1013,19 @@
 %   \item Tricky programming might access a frozen |\endwrite|.
 %   \item Some frozen tokens can only be accessed in interactive
 %     sessions: |\cr|, |\right|, |\endgroup|, |\fi|, |\inaccessible|.
+%   \item In \LuaTeX{}, there is also the strange case of \enquote{bytes}
+%     |^^^^^^1100|$x$$y$ where $x,y$ are any two lowercase hexadecimal
+%     digits, so that the hexadecimal number ranges from
+%     $"\text{110000}=1114112$ to~$"\text{1100ff}=1114367$.  These are
+%     used to output individual bytes to files, rather than UTF-8.  For
+%     the purposes of token comparisons they behave like non-expandable
+%     primitive control sequences (\emph{not characters}) whose
+%     \tn{meaning} is \verb*|the character | followed by the given byte.
+%     If this byte is in the range |80|--|ff| this gives an ``invalid
+%     utf-8 sequence'' error: applying \cs{token_to_str:N} or
+%     \cs{token_to_meaning:N} to these tokens is unsafe.  Unfortunately,
+%     they don't seem to be detectable safely by any means except perhaps
+%     Lua code.
 % \end{itemize}
 %
 % The meaning of a (non-active) character token is fixed by its category
@@ -1028,7 +1037,7 @@
 % \begin{itemize}
 %   \item[1] begin-group character (|group_begin|, often |{|),
 %   \item[2] end-group character (|group_end|, often |}|),
-%   \item[3] math shift character (|math_toggle|, often |$|),
+%   \item[3] math shift character (|math_toggle|, often |$|), % $
 %   \item[4] alignment tab character (|alignment|, often |&|),
 %   \item[6] macro parameter character (|parameter|, often |#|),
 %   \item[7] superscript character (|math_superscript|, often |^|),
@@ -1058,18 +1067,16 @@
 %   \item a font selection command,
 %   \item undefined.
 % \end{itemize}
-% Macros be \tn{protected} or not, \tn{long} or not (the opposite of
+% Macros can be \tn{protected} or not, \tn{long} or not (the opposite of
 % what \LaTeX3 calls |nopar|), and \tn{outer} or not (unused in
 % \LaTeX3).  Their \tn{meaning} takes the form
 % \begin{quote}
-%   \meta{properties} |macro:|\meta{parameters}|->|\meta{replacement}
+%   \meta{prefix} |macro:|\meta{argument}|->|\meta{replacement}
 % \end{quote}
-% where \meta{properties} is among \tn{protected}\tn{long}\tn{outer},
-% \meta{parameters} describes parameters that the macro expects, such as
+% where \meta{prefix} is among \tn{protected}\tn{long}\tn{outer},
+% \meta{argument} describes parameters that the macro expects, such as
 % |#1#2#3|, and \meta{replacement} describes how the parameters are
-% manipulated, such as~|#2/#1/#3|.
-%
-% ^^A todo Bruno: discuss here some other subtleties of space tokens? when looking for numbers, when looking for equal signs in let, in expressions, etc.
+% manipulated, such as~|\int_eval:n{#2+#1*#3}|.
 %
 % Now is perhaps a good time to mention some subtleties relating to
 % tokens with category code $10$ (space).  Any input character with this
@@ -1087,6 +1094,13 @@
 % single tokens \enquote{N-type}, as they are suitable to be used as an
 % argument for a function with the signature~\texttt{:N}.
 %
+% When a macro takes a delimited argument \TeX{} scans ahead until
+% finding the delimiter (outside any pairs of begin-group/end-group
+% explicit characters), and the resulting list of tokens (with outer
+% braces removed) becomes the argument.  Note that explicit space
+% characters at the start of the argument are \emph{not} ignored in this
+% case (and they prevent brace-stripping).
+%
 % \end{documentation}
 %
 % \begin{implementation}