[luatex] Hash tokens meaning

Tue May 28 15:04:24 CEST 2013

On 5/28/13 1:01 PM, Arthur Reutenauer wrote:

>> I am trying to analyze what TeX produces (dumping the contents of
>> tex.hashtokens() by the way).
>
>    Oh, so that was it ...  Well, I can say with confidence that there
> were probably only three to five people in the world who had any chance
> of understanding what you meant by "hash tokens", in your original
> email, and none of them is contributing to this discussion (but one of
> them definitely is subscribed to this list ;-)

I supposed it was clear, because I posted to luatex asking about hash 
tokens. I was obviously misleading others! :)

>    So part of what I said earlier doesn't apply, you really are looking
> at TeX's hash table.  This is yet different, and happens at a very low
> level.  The documentation for that is in tex.web, and the change files
> for the different extensions.  You're not really doing yourself a favour
> by starting with LuaTeX; better to start with Knuth's TeX, in my
> opinion.  Its source code, along with the comments, actually is
> published as a book.

Good to know, I supposed I could start by getting a low level 
impression, the same way I do when asking for symbols in an object 
files, and next take a look at the disassembly.

>> So I am only looking at the tokens produced by TeX, feeding a LaTeX
>> file: I know what LaTeX does, but since it uses TeX as an engine, I
>> wanted to know what TeX does with my document structure (labels,
>> chapters, floats, bibliographies, ...).
>
>    Which is absurd: shouldn't you look at the source code of *LaTeX*
> first, before looking at a dump of TeX's memory?  It's almost like you
> want to be confused.

That is not my purpose, and by the way, yes I look at the memory when 
trying to figure out how a piece of software works (it's part of my 
job), especially when you assume that you don't have the source code.

>> As before, I just dump to file what tex.hashtokens() contains. I can
>> attach the file if needed.
>
>    Yes, obviously, we need the source file.  Did you really imagine that
> we could say anything substantial about random bits of TeX's memory
> without knowing what the input was?

I attached parts of it, since the symbol table is 70K.

>
>> ===BEGIN===
>> sffamily
>> ^A
>> tracingoutput
>>                      <=== THIS IS A TAB
>> ^H
>> ^K
>> macc at palette
>> ^M
>> ^L
>> ^N
>> @currdir
>> makesm at sh
>> pdftrue
>> ?\textless
>> @@MP:P:curveto
>> ^Y
>> ^[
>> ^Z
>> luatexUroot
>> !
>>                      <=== THIS IS A SPACE
>> ====END====
>
>    OK, so you meant white space.  Blank is indeed a misleading word to
> call these strings.  Yes, there may be white space.  Why does it bother
> you?  "\ " actually is a pretty common user-level command of TeX.

Because it's new. I thought of it as an escaping in C, a sort, let's say 
this, of protection of the next character, as in \% (the same way in C 
for \").

>> ===BEGIN===
>> pagecolor
>> ,                 <=== THIS IS A WEIRD ONE
>> skipemptyMPgraphictrue
>>
>> ====END====
>>
>> With an hex editor, I find that the second line is EF BF BF 2C.
>
>    This is perfectly valid UTF-8, it's the byte sequence for two
> characters: U+FFFF and U+002C.  The former is not supposed to be used in
> files, and usually appears as a replacement of an invalid character, and
> the latter is simply a comma.

Yes, I knew that once opened the hash file with an hex editor. I knew 
TeX didn't have support for unicode, and I thought that lualatex 
translated into TeX, which produced an output. So a unicode string was 
unexpected, and I thought I messed up with my dump code.

>> It seems to me that TeX is using a very low level encoding, which I
>> find again weird (or wrong, in the sense that I don't know how to
>> correctly dump the tokens).
>
>    You may have dumped the tokens correctly, there is a lot of low-level
> stuff in TeX.  What's surprising to me is that you find it weird!

Pardon me, but I'm used to write code in C, assembly, C++, or whatever 
other programming language (mainly those three, in that order). TeX is 
very, very different.

>> Yes, I imagined it was related to the Narnian way of encoding fonts,
>> but I don't know how it encodes it (I found a document by Rahtz on
>> TUG, but I see no mention of "<>").
>
>    Look again, then.  The long string you quoted (<5><6> etc.) clearly is
> the fifth argument to \DeclareFontShape, one of the standard NFSS
> commands.  It's part of the LaTeX2e and is documented in several places,
> for example the LaTeX Companion, or, for a free resource,
> doc/latex/base/fntguide.pdf in most TeX distributions.

Good!

>> You don't have this interest, it's ok, but I really do! I like to
>> know how something works! ;)
>
>    You're missing the point.  Producing \r at something is *one of the many*
> things that happens when you type \label{something}; it's probably the
> control sequence whose name is most obviously related to the label you
> created, but there is nothing special about that particular control
> sequence.  That's why I remarked that it's not an interesting fact, and
> you probably wouldn't have noticed it, hadn't it been for your biased
> approach of looking at static memory dumps.
>
>    Far more interesting are the different commands defined by LaTeX when
> \label is called, look for "ltxref.dtx" in latex.ltx.  The letter "r"
> (in \r at something) is introduced in a macro called \newlabel (line 3881
> of my copy of latex.ltx), and "@" in \@newl at bel, one line above it.

That is awesome, I now have a place to start!

Anyway, at some point there *is* a static version of a code somewhere, 
otherwise there would be no output. Yes, I am biased by my job and 
education, but I find hard to grasp the opposition to this approach.

You look "top down", I use the "bottom up" approach :)