[l2h] extracting bag-of-words from latex

Hamilton Link hamlink at cs.unm.edu
Tue Feb 14 21:45:00 CET 2006


Hi, I'm about to try to process a large set of LaTeX files.  What I 
would like is to strip the files of equations, formatting, comments, 
etc. to produce a text file of "just the words," so to speak.  As far 
as I can tell the ways of potentially doing this would be:

- compile the latex to ps or pdf and then run a word extractor on that
- run latex2rtf or latex2html and do word extraction from that

Does anyone on the list know of a better way, or have any suggestions 
as to how I might proceed using latex2html as far as configurations or 
settings that might ease the process etc.?

Please copy me on the response, I'm not subscribed to the latex2html 
mailing list.

thanks in advance,
hamilton



More information about the latex2html mailing list