distribution of bibtex scraping methods, an end of year futility review lol
Mike Marchywka
marchywka at hotmail.com
Thu Dec 24 21:16:32 CET 2020
From time to time I've posted about the problems scraping bibtex from webpages. My script to
do this has evolved and I keep threatening to move the logic into c++ but using mostly bash
invokations to do the real work. I was trying to determine which of the methods I use
is really worth the effort to maintain. I think in retrospect I'm not sure how many publisher-specific
methods are really worthwhile as scraping the doi and using crossref seems to work
well most of the time although they are also throttling requests which is annoying while trying
to do research because I have to wait for the response manually inspect it.
Just from the bib files I have right now, and since I made the comments uniform ( only recently ),
here is a list of the most common bibtex sources. The "autofetched" are from citations
named with PMID or PMC number and fetched from pubmed during the build.
"handle" refers to a publisher agnostic scraping system. The "guess" words tend to
be named after specific domains for given publishers. These tend to change often
and have various kinds of anti-automation features. I had not realized how often
I was scraping from pdf files however. Some of the publisher scrapers though are worthwhile
because the manual process is harder than maintaining the script and the doi may not
be prominent lol. fwiw
grep "med2bib comment\|autofetch" `find .. -name "*.bib"| grep pmc ` | sed -e 's/at .*//g' | mjm zed 1 | sort | uniq -c | sort -r -g | sed -e 's/med2bib comment://g'
689 % autofetched
265 handledoi
90 handlepdf
41 handlehighwire
39 guessscidirect
23 guesswiley
22 guessspringer
21 biomedcentral.com
17 guessresearchgate2
16 handlegsmeta
15 guessoup
15 guessjbc
13 guessplos
11 guesscitmgr
11 guesscambridge
8 guesssemantic
7 highwire/asm.org
6 guessepmc
6 guessahajournal
5 guessnature
5 guesskarger
4 handlepdfexif
4 asm.org
3 rawdoi
3 guesstandf
3 guesskidint
3 guessjci
2 handlespring
2 guessuridc
2 guesssemanticnu
2 guessmdpi
2 guesslibert
2 guessfuture
2 guesselife
2 guessarxivthree
2 autofetched
1 guesssci
1 guessnejm
1 guessjlr
1 guessfcklww
1 guessfasebh doi=10.1096/fasebj.2019.33.1_supplement.719.14&downloadFileName=dummy&include=abs&format=bibtex&direct=
1 guesscab
1 embropress.org
note new address
Mike Marchywka 306 Charles Cox Drive Canton, GA 30115
2295 Collinworth Drive Marietta GA 30062. formerly 487 Salem Woods Drive Marietta GA 30067 404-788-1216 (C)<- leave message 989-348-4796 (P)<- emergency
More information about the texhax
mailing list.