[texworks] Wordcount

Mojca Miklavec mojca.miklavec.lists at gmail.com
Sat Aug 1 11:27:34 CEST 2009


On Sat, Aug 1, 2009 at 06:24, CB wrote:
>> It's a reasonable request, though doing a word-count for any kind of (La)TeX
>> document can be a rather ill-defined affair -- what exactly is a "word", and
>> which ones count as being part of the document content? These things are
>> often not clear-cut.
>
> True enough, evidenced by the fact that I've tried a few LaTeX
> word-counting tools over the years, but haven't ever had 2 that give
> the same result. One possibility might be to count the words in the
> pdf output rather than the source. That way the user could decide
> which bits to typeset and thus have included in the count. I don't
> know if this is technically feasible, but if so it would get around
> many of the issues.

Counting words in the resulting PDF probably works a bit better, but even then:
- there are page numbers as well as headers & footers
- there are section numbers
- there are footnote numbers
- there are math formulas (how many words are
\sqrt{a+b+\sin\alpha=\hbox{something}}?), formula numbers and
fractions that get split into multiple numbers in PDF
- there are tables with numbers, number are sometimes separated with
dot, sometimes with commas (almost impossible to guess whether comma
is thousand/decimal separator or separator between numbers)
- you can create TikZ/metapost graphic and place some labels on the
figure; whether those labels will count or not will depend on the tool
that you use for figures (precompiled or not); worse - you can
probably even include tables as existing PDF figures
- there are hyphenated words, words like \alpha-helix, \gamma-rays
- there are accented letters that PDF viewers are not able to handle properly
- this must be a bug in apple library, but when I copy-paste text from
PDF I get both accents lost and words are being split before letter j
into two

Counting the words from source document is mission-impossible. I mean
- you can count some heuristics, but as soon as you start using
    \def\test{this is a long sequence}\test\test\test
the word-count in source will fail considerable unless you reimplement
TeX in it. Not even that. You can start with {\v c} and that alone can
cause enough confusion to word counter.

If you need just informative word count, anything can do (copy from
pdf and paste into word or "wc" in command line), but if you need to
write an article with exactly some number of words or if you need to
charge a client for translation of text that's X words long ... those
statistics can be highly misleading and I would not rely on them.

In any case: if you want to do character count and simplistic word
count in editor, that should be done by a lua script in my opinion (so
that it becomes more flexible).


One idea that did come to my mind though. It wouldn't solve any of the
above mentioned problems with accuracy, but could come closest to it
... asking the author of SyncTeX to ship some statistics about
character and word count. I have no idea how SyncTeX works, but it
knows a bit about both TeX source and PDF, so among all the possible
tools ... that one could have the most clue what's happening in the
background and could serve most users at once. The problem is still
that even if you would get that statistics, it will remain hightly
inacurate no matter how hard you try. The real problem comes when
people start beileving into that statistics blindly.

Mojca


More information about the texworks mailing list