[l2h] extracting bag-of-words from latex

Hakan Kuecuekyilmaz hakan at mysql.com
Wed Feb 15 00:52:55 CET 2006

On Tue, 2006-02-14 at 13:45 -0700, Hamilton Link wrote:
> Hi, I'm about to try to process a large set of LaTeX files.  What I 
> would like is to strip the files of equations, formatting, comments, 
> etc. to produce a text file of "just the words," so to speak.  As far 
> as I can tell the ways of potentially doing this would be:
> - compile the latex to ps or pdf and then run a word extractor on that
> - run latex2rtf or latex2html and do word extraction from that
> Does anyone on the list know of a better way, or have any suggestions 
> as to how I might proceed using latex2html as far as configurations or 
> settings that might ease the process etc.?

You could try untex[1]?

[1] http://ftp.tu-clausthal.de/pub/mirror/ctan/support/untex

Regards, Hakan
Hakan Kuecuekyilmaz

More information about the latex2html mailing list