[tex-live] apparent bug in detex

cfrees at imapmail.org cfrees at imapmail.org
Mon Nov 8 20:18:20 CET 2010

Thanks to all the people who provided suggestions regarding my unusably
slow copies of pdftotext, pdfinfo etc. I now have *two* versions of
these tools which work admirably quickly. (At least, pdftotext does.)

In case anybody is interested, one of these is installed from xpdf.
This compiled relatively easily. (I didn't want to use the binaries
because I would have lost xpdf itself.)

The second is from an updated compile of poppler. This is more
problematic both because I'm worried about what might be depending on
poppler and because compiling it turned out to be a considerable pain.
I suspect I may have broken gimp... More specifically, compiling the
recommended openjpeg turned out to be far from straight forward. But the
resulting pdftotext does seem to work. The trick, for anybody who reads
this and decides to experiment is to download the two patch files for
openjpeg provided by macports.  (You don't have to use macports to use
their patch files, fortunately.)

[I sometimes wonder why the amendments found necessary by the macports
people never seem to make it back upstream. This is especially true in
this case where one of the two files requiring a patch is specific to
mac os x/darwin. Neither of the patches have anything to do with
macports specifically. They are just what's needed to get it to compile
on os x. In any case, I wasted a lot of time before thinking to check
for macports patches.]

Having done all this, I'm not convinced this is the best way to get an
accurate word count. The output from the xpdf version of pdftotext is
strange and includes many non-ascii character codes. (For example, it
seems unable to deal with ligatures.) The output from the poppler
version is definitely more readable. Even so, it isn't that suitable
for word count purposes as it includes, for example, the header, footer
and page number from each page. Especially in a draft, I have quite a
lot of information in the footer so the repetition of this per page
would skew the word count considerably.

It also includes the references. I'm honestly not certain if these are
meant to be counted or not so I'm not sure whether to consider this a
feature or a bug.

I suspect texcount is giving me a more accurate count than any of the
other methods right now though it is hard to be certain about this.


