distribution of bibtex scraping methods, an end of year futility review lol

Mike Marchywka marchywka at hotmail.com
Thu Dec 24 21:16:32 CET 2020


 From time to time I've posted about the problems scraping bibtex from webpages. My script to
do this has evolved and I keep threatening to move the logic into c++ but using mostly bash
invokations to do the real work. I was trying to determine which of the methods I  use
is really worth the effort to maintain. I think in retrospect I'm not sure how many publisher-specific
methods are really worthwhile as scraping the doi  and using crossref seems to work
well most of the time although they are also throttling requests which is annoying while trying
to do research because I have to wait for the response manually inspect it.

Just from the bib files I have right now, and since I made the comments uniform ( only recently ),
here is a list of the most common bibtex sources.  The "autofetched" are from citations
named with PMID or PMC number and fetched from pubmed during the build.
"handle" refers to a publisher agnostic scraping system.  The "guess" words tend to 
be named after specific domains for given publishers. These tend to change often
and have various kinds of anti-automation features.  I had not realized how often
I was scraping from pdf files however. Some of the publisher scrapers though are worthwhile
because the manual process is harder than maintaining the script and the doi may not
be prominent lol.  fwiw

grep "med2bib comment\|autofetch" `find .. -name "*.bib"| grep pmc ` | sed -e 's/at .*//g' |  mjm zed 1 | sort | uniq -c | sort -r -g | sed -e 's/med2bib comment://g' 
    689 % autofetched
    265  handledoi
     90  handlepdf
     41  handlehighwire
     39  guessscidirect
     23  guesswiley
     22  guessspringer
     21  biomedcentral.com
     17  guessresearchgate2
     16  handlegsmeta
     15  guessoup
     15  guessjbc
     13  guessplos
     11  guesscitmgr
     11  guesscambridge
      8  guesssemantic
      7  highwire/asm.org
      6  guessepmc
      6  guessahajournal
      5  guessnature
      5  guesskarger
      4  handlepdfexif
      4  asm.org
      3  rawdoi
      3  guesstandf
      3  guesskidint
      3  guessjci
      2  handlespring
      2  guessuridc
      2  guesssemanticnu
      2  guessmdpi
      2  guesslibert
      2  guessfuture
      2  guesselife
      2  guessarxivthree
      2 autofetched
      1  guesssci
      1  guessnejm
      1  guessjlr
      1  guessfcklww
      1  guessfasebh doi=10.1096/fasebj.2019.33.1_supplement.719.14&downloadFileName=dummy&include=abs&format=bibtex&direct=
      1  guesscab
      1  embropress.org




note new address
 Mike Marchywka 306 Charles Cox Drive Canton, GA 30115
 2295 Collinworth  Drive Marietta GA 30062.  formerly 487 Salem Woods Drive Marietta GA 30067 404-788-1216 (C)<- leave message 989-348-4796 (P)<- emergency



More information about the texhax mailing list.