html scraping for bibtex seems to be more unreliable, a case with Zotero

Mike Marchywka marchywka at hotmail.com
Mon Jul 4 21:42:49 CEST 2022


I was curious about this, 

https://www.mdpi.com/2072-6643/14/3/639

with the Zotero web form returning this, 

@article{grant_narrative_2022,
	title = {A narrative review of the evidence for variations in serum 25-hydroxyvitamin d concentration thresholds for optimal health},
	volume = {14},
	copyright = {http://creativecommons.org/licenses/by/3.0/},
	issn = {2072-6643},
	url = {https://www.mdpi.com/2072-6643/14/3/639},
	doi = {10.3390/nu14030639},
	abstract = {Vitamin D3 has many important health benefits. Unfortunately, these benefits are not widely known among health care personnel and the general public. As a result, most of the world’s population has serum 25-hydroxyvitamin D (25(OH)D) concentrations far below optimal values. This narrative review examines the evidence for the major causes of death including cardiovascular disease, hypertension, cancer, type 2 diabetes mellitus, and COVID-19 with regard to sub-optimal 25(OH)D concentrations. Evidence for the beneficial effects comes from a variety of approaches including ecological and observational studies, studies of mechanisms, and Mendelian randomization studies. Although randomized controlled trials (RCTs) are generally considered the strongest form of evidence for pharmaceutical drugs, the study designs and the conduct of RCTs performed for vitamin D have mostly been flawed for the following reasons: they have been based on vitamin D dose rather than on baseline and achieved 25(OH)D concentrations; they have involved participants with 25(OH)D concentrations above the population mean; they have given low vitamin D doses; and they have permitted other sources of vitamin D. Thus, the strongest evidence generally comes from the other types of studies. The general finding is that optimal 25(OH)D concentrations to support health and wellbeing are above 30 ng/mL (75 nmol/L) for cardiovascular disease and all-cause mortality rate, whereas the thresholds for several other outcomes appear to range up to 40 or 50 ng/mL. The most efficient way to achieve these concentrations is through vitamin D supplementation. Although additional studies are warranted, raising serum 25(OH)D concentrations to optimal concentrations will result in a significant reduction in preventable illness and death.},
	language = {en},
	number = {3},
	urldate = {2022-07-04},
	journal = {Nutrients},
	author = {Grant, William B. and Al Anouti, Fatme and Boucher, Barbara J. and Dursun, Erdinç and Gezen-Ak, Duygu and Jude, Edward B. and Karonova, Tatiana and Pludowski, Pawel},
	month = jan,
	year = {2022},
	keywords = {Alzheimer’s disease, cancer, cardiovascular disease, COVID-19, diabetes, hypertension, Mendelian randomization, vitamin D, 25-hydroxyvitamin D},
	pages = {639},
}

but the "han" month appears to be wrong. This may be an issue with "real" 
print date vs date on the journal but probably not in 
this case,  history from the html:  

Received: 7 January 2022 / Revised: 25 January 2022 / Accepted: 28 January 2022 / Published: 2 February 2022

I was puzzled because I had to fix some TooBib date code recently.
The first TooBib hit is below but using the "-all" mode 
I found the source of the "jan" month by grepping the 14 candidates
TooBib found it looks like the html contains a citiation_publication_date
entry of 2022/1 making another instance where the html from the
site seems incorrect. 

grep -i "handler\|publicat"   xxxx 
% mjmhandler: toobib guesssmdpi (xref) 
% mjmhandler: toobib guesssmdpi
final_assembly ={ TooBib handler guesssmdpi},
[...]
publication-date = {2022-02-02},
final_assembly ={ TooBib handler handledoixml ( quality i=0 szr=3 goods=1 )(crossref)},
% mjmhandler: toobib handleldjson2(all)
final_assembly ={ TooBib handler handleldjson2(all)},
% mjmhandler: toobib handleadhochtml<-citation
publication_date = {2022/1},
final_assembly ={ TooBib handler handleadhochtml},
% mjmhandler: toobib handleadhochtml<-DC
final_assembly ={ TooBib handler handleadhochtml},
% mjmhandler: toobib handleadhochtml<-og
final_assembly ={ TooBib handler handleadhochtml},
% mjmhandler: toobib handleadhochtml<-all
citation_publication_date = {2022/1},
prism.publicationdate = {2022-02-02},
prism.publicationname = {Nutrients},

... etc ... 

The first TooBib hit used crossref to look up the scraped DOI. From the
local copy of the PDF it did about the same thing,  


 toobib -local -clip
toobib set to ../toobib/toobib.out -devel
mjm>clip xxxx
./toobib.h546  cmd=clip p1=xxxx p2= flags=18 x.flag_to_string(flags,0)=show_trial paste_citation 
./mjm_med2bib_guesses.h982  uin=https://www.mdpi.com/2072-6643/14/3/639 dest=xxxx flags=18
./mjm_med2bib_guesses.h1136 % mjmhandler: toobib guesssmdpi (xref) 
% date 2022-07-04:15:38:15 Mon Jul 4 15:38:15 EDT 2022
% srcurl: https://www.mdpi.com/2072-6643/14/3/639
% citeurl: http://api.crossref.org/works/10.3390/nu14030639
@article{Grant_Anouti_Boucher_Narrative_Review_2022,
X_TooBib = {date: 02/02/2022},
X_TooBib = {year: 2022,  infield_fix_dates },
X_TooBib = {month: 02,  infield_fix_dates },
X_TooBib = {day: 02,  infield_fix_dates },
X_TooBib = {urldate: FixBeKvp s= cmd=date "+%Y-%m-%d" d=2022-07-04 dn=urldate},
X_TooBib = {author: Grant , William B. and Anouti , Fatme Al and Boucher , Barbara J. and Dursun , Erdin\c{c} and Gezen-Ak , Duygu and Jude , Edward B. and Karonova , Tatiana and Pludowski , Pawel},
abbrvjrnl = {Nutrients},
abstract = {<jats:p>{Vitamin} {D3} has many important health benefits. {Unfortunately,} these benefits are not widely known among health care personnel and the general public. {As} a result, most of the world's population has serum 25-hydroxyvitamin {D} {(25(OH)D)} concentrations far below optimal values. {This} narrative review examines the evidence for the major causes of death including cardiovascular disease, hypertension, cancer, type {2} diabetes mellitus, and {COVID-19} with regard to sub-optimal {25(OH)D} concentrations. {Evidence} for the beneficial effects comes from a variety of approaches including ecological and observational studies, studies of mechanisms, and {Mendelian} randomization studies. {Although} randomized controlled trials {(RCTs)} are generally considered the strongest form of evidence for pharmaceutical drugs, the study designs and the conduct of {RCTs} performed for vitamin {D} have mostly been flawed for the following reasons: they have been based on vitamin {D} dose rather than on baseline and achieved {25(OH)D} concentrations; they have involved participants with {25(OH)D} concentrations above the population mean; they have given low vitamin {D} doses; and they have permitted other sources of vitamin {D.} {Thus,} the strongest evidence generally comes from the other types of studies. {The} general finding is that optimal {25(OH)D} concentrations to support health and wellbeing are above {30} {ng/mL} {(75} {nmol/L)} for cardiovascular disease and all-cause mortality rate, whereas the thresholds for several other outcomes appear to range up to {40} or {50} {ng/mL.} {The} most efficient way to achieve these concentrations is through vitamin {D} supplementation. {Although} additional studies are warranted, raising serum {25(OH)D} concentrations to optimal concentrations will result in a significant reduction in preventable illness and death.</jats:p>},
abstract_as_rcvd = {<jats:p>Vitamin D3 has many important health benefits. Unfortunately, these benefits are not widely known among health care personnel and the general public. As a result, most of the world’s population has serum 25-hydroxyvitamin D (25(OH)D) concentrations far below optimal values. This narrative review examines the evidence for the major causes of death including cardiovascular disease, hypertension, cancer, type 2 diabetes mellitus, and COVID-19 with regard to sub-optimal 25(OH)D concentrations. Evidence for the beneficial effects comes from a variety of approaches including ecological and observational studies, studies of mechanisms, and Mendelian randomization studies. Although randomized controlled trials (RCTs) are generally considered the strongest form of evidence for pharmaceutical drugs, the study designs and the conduct of RCTs performed for vitamin D have mostly been flawed for the following reasons: they have been based on vitamin D dose rather than on baseline and achieved 25(OH)D concentrations; they have involved participants with 25(OH)D concentrations above the population mean; they have given low vitamin D doses; and they have permitted other sources of vitamin D. Thus, the strongest evidence generally comes from the other types of studies. The general finding is that optimal 25(OH)D concentrations to support health and wellbeing are above 30 ng/mL (75 nmol/L) for cardiovascular disease and all-cause mortality rate, whereas the thresholds for several other outcomes appear to range up to 40 or 50 ng/mL. The most efficient way to achieve these concentrations is through vitamin D supplementation. Although additional studies are warranted, raising serum 25(OH)D concentrations to optimal concentrations will result in a significant reduction in preventable illness and death.</jats:p>},
affiliation = {},
alternative-id = {nu14030639},
author = {Grant , William B. and Anouti , Fatme Al and Boucher , Barbara J. and Dursun , Erdin\c{c} and Gezen-Ak , Duygu and Jude , Edward B. and Karonova , Tatiana and Pludowski , Pawel},
author_as_rcvd = {William B. Grant and Fatme Al Anouti and Barbara J. Boucher and Erdinç Dursun and Duygu Gezen-Ak and Edward B. Jude and Tatiana Karonova and Pawel Pludowski},
author_orig = {William B. Grant and Fatme Al Anouti and Barbara J. Boucher and Erdin\c{c} Dursun and Duygu Gezen-Ak and Edward B. Jude and Tatiana Karonova and Pawel Pludowski},
bib-source = {Crossref},
content-domain = {false},
date = {02/02/2022},
date-created = {2022-02-03T10:42:33Z},
date-deposited = {2022-02-032022-02-03T11:26:24Z},
date-indexed = {2022-06-24T04:24:50Z},
date-issued = {2022-02-02},
date-journal-issue = {2022-02},
date-license = {2022-02-022022-02-02T00:00:00Z},
date-published-online = {2022-02-02},
date_orig = { 2022-02       2022-02-02 },
day = {02},
deposited = {1643887584000},
doi = {10.3390/nu14030639},
is-referenced-by-count = {8},
issn = {2072-6643},
issn-type = {2072-6643, electronic},
issue = {3},
journal = {Nutrients},
journal-issue = {3},
language = {en},
license = {1643760000000, vor, 0, https://creativecommons.org/licenses/by/4.0/},
link = {https://www.mdpi.com/2072-6643/14/3/639/pdf, unspecified, vor, similarity-checking},
member = {1968},
month = {02},
page = {639},
prefix = {10.3390},
publication-date = {2022-02-02},
publisher = {MDPI AG},
reference = {deleted for space},
reference-count = {135},
references-count = {135},
resource = {https://www.mdpi.com/2072-6643/14/3/639},
score = {1},
subject = {Nutrition and Dietetics},
title = {A Narrative Review of the Evidence for Variations in Serum 25-Hydroxyvitamin D Concentration Thresholds for Optimal Health},
type = {journal-article},
url = {http://dx.doi.org/10.3390/nu14030639},
urldate = {2022-07-04},
volume = {14},
year = {2022},
final_assembly ={ TooBib handler guesssmdpi (xref) },
srcurl={https://www.mdpi.com/2072-6643/14/3/639},
xsrcurl={https://www.mdpi.com/2072-6643/14/3/639},
citeurl={http://api.crossref.org/works/10.3390/nu14030639}

}


I'd post this on the Zotero forum but AFAICT they are blocking
my contributions to humanity and this tedious task does
encounter a lot of problems :) I am getting a bit sarcastic
because this scraping is detracting from more important
stuff but it should be fixed...
With a URL and DOI it is unlikely to matter if a month is
wrong and I tend to prefer to credit the authors
with the earliest useful date but that is another
issue if you are trying to establish the evolution of
thought on some topic of just who got there first...





-- 

mike marchywka
306 charles cox
canton GA 30115
USA, Earth 
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X



More information about the texhax mailing list.