[texhax] Can I Parsing with TeX ?

Toby Cubitt tsc25 at cantab.net
Wed Aug 15 13:54:02 CEST 2007


wa2n wrote:
> Hi all
> Can TeX doing parsing or regexp ?
> 
> or the question is
> how can I replace something in tex source (.tex) while I'm compiling the
> source (make it .dvi) ?

The short answer is: not really. The long answer is: perhaps, depending 
on what exactly you want to do, but coding it will probably involve a 
lot of pain. You're almost certainly better off using an external 
utility designed for this task, such as sed, awk, or perl.

However, if you really, really can't avoid doing it in TeX, here's my 
understanding of why it's so difficult (others will probably be able to 
correct/improve on this). TeX is a macro language, which means it works 
simply by expanding macros, expanding the results of that expansion, 
expanding the results of that, and so on until there's nothing left to 
expand (i.e. the expression contains only primitive or unexpandable 
tokens). When TeX reads a file, it reads the file into its input stream, 
parses the characters in the input stream, converting them into a 
sequence of tokens according to the current category codes ("catcodes"), 
then expands these tokens as necessary. There's a better description of 
this whole process in the "TeX-by-Topic" book (available online).

To get TeX to replace one string (that matches a regexp, say) with 
another, you would have to make that string expandable in some way. This 
would probably involve changing the catcodes of those characters, since 
normal letters are otherwise parsed as single, unexpandable tokens. 
However, the catcodes take effect when the input stream is parsed, so 
you'd have to change the catcodes before the string you intend to 
replace gets parsed. If you wanted to do general string replacement on 
an input file, you'd essentially have to turn TeX into a 
string-replacement machine by redefining catcodes and defining the 
necessary macros right at the start, before the file gets read into the 
input stream. This might be possible in principle, but it will 
undoubtedly be difficult. Then you'll need to undo all this messing 
around with TeX internals, and feed everything back into TeX to have it 
process the string-replaced file normally. However, the catcodes were 
already fixed when the file was read the first time. The only way to 
fully "reset" the catcodes is to write the string-replaced file to disk, 
revert the catcodes, then re-read the file again.

I expect you'd rather not have to code all this :)

If you want to see an example of this kind of thing in action, you could 
have a look at the code that implements the "poorman" option in my 
"cleveref" package (on CTAN). I used a very simplified form of the above 
to do very, very simple string replacement: replacing certain single 
characters in a file with "escaped" versions of those characters. As 
you'll see if you look at the code, it works by changing the catcodes of 
the relevant characters to turn them into active characters (single 
characters that TeX expands into something else), reading the file into 
the TeX input stream and processing it, thereby replacing the active 
characters with their expansions, and writing the result back out to file.

Hope that helps. I'm still a relative novice at TeX programming, so some 
of what I said might be inaccurate, but TeXperts on this list should be 
able to correct it.

Toby Cubitt


More information about the texhax mailing list