[l2h] extracting bag-of-words from latex
Hamilton Link
hamlink at cs.unm.edu
Tue Feb 14 21:45:00 CET 2006
Hi, I'm about to try to process a large set of LaTeX files. What I
would like is to strip the files of equations, formatting, comments,
etc. to produce a text file of "just the words," so to speak. As far
as I can tell the ways of potentially doing this would be:
- compile the latex to ps or pdf and then run a word extractor on that
- run latex2rtf or latex2html and do word extraction from that
Does anyone on the list know of a better way, or have any suggestions
as to how I might proceed using latex2html as far as configurations or
settings that might ease the process etc.?
Please copy me on the response, I'm not subscribed to the latex2html
mailing list.
thanks in advance,
hamilton
More information about the latex2html
mailing list