www type bibtex entries - generating bibtex for webpages + prior theme.

Peter Flynn peter at silmaril.ie
Sun Sep 15 12:14:59 CEST 2019

On 14/09/2019 22:51, Mike Marchywka wrote:
> In a prior thread I was describing some reasons to prefer latex-like
> document "source" over things like html or explicit xml. 

I'm not clear what "explicit" XML is (as opposed to what?)

> Someone offered the CELT site below as an example of an experiment
> related to this topic. 

That would be me :-)

> In the link in the sample bibtex below, there is a link to xml
> described as the "source document",

Where did you see http://research.ucc.ie/celt/document/E590001-007 
described as the "source document"?

> %2019-09-14:17:16:49
> %autogenerated by toobib
> @www{CELTprojectBriefeucc,
> authors = {},
> title = {CELT project: A Briefe description of Ireland: made in this year, 1589, By Robert Payne | University College Cork},
> url = {http://research.ucc.ie/celt/document/E590001-007},
> urldate = {2019-09-14:17:16:49},
> year = {}
> }
> so called "source document":

That's some kind of auto-generated bib file about the web page. The CELT
project does not call this a source document. For the source
document you can look in the web page and click on "Header" and then
"Source" where you will find the BiBTeX:

   editor 	 = {Aquilla Smith},
   title 	 = {A Brife description of Ireland: made in this yeere. 1589. 
By Robert Payne. vnto xxv. of his partners for whom he is undertaker 
there. Truely published verbatim, according to his letters, by Nich. 
Gorsan one of the said partners, for that he would his countrymen should 
be partakers of the many good Notes therein conteined. With diuers Notes 
taken out of others the Authoures letters written to his said partners, 
sithenes the first Impression, well worth the reading.},
   booktitle 	 = {Tracts relating to Ireland, printed for the Irish 
Archaeological Society.},
   address 	 = {Dublin},
   publisher 	 = {University Press, Graisberry and Gill},
   date 	 = {1841},
   volume 	 = {1},
   note 	 = {v–viii; 3–14 (separate pagination)}

> http://research.ucc.ie/celt/document/E590001-007.xml
> While it is quite true that this xml provides good explicit structure
> and is "human readable" it does not quite "flow" like simple latex
> source code.
I'm not clear what "flow" means in this context. The XML document is an 
an accurate representation of the original book from 1841. It begins 
like this:

       <div0 type="description" lang="en">
	<head>A Brife description of Ireland: made in this yeere.
	  1589. By Robert Payne [...]</head>
	<pb n="3"/>
	<div1 type="section" n="1">
	  <p><text type="letter">
		<p>Let not the reportes of those that haue spent all
		  their owne and what they could by any meanes get
		  from others in England, discourage you from
		  Irela<ex>n</ex>d, although they and such others by
		  bad dealinges haue wrought a generall discredite to
		  all English men, in that countrie which are to the
		  Irishe vnknowen.</p>

I'm not sure that there is any other meaningful way to do it: the 
objective of the project is to capture the text and *accurate* structure 
of the original, so there's a divisional container, a heading, a 
pge-break, a numbered sub-container, with a quoted letter with its own 
internal structure, etc.

> That is you could read most latex source as if it was meant to be
> understood versus html or this xml.
Correct. XML is a file storage format. It contains information that 
LaTeX does not have by default (eg nested containers)

> The latex just provides 


> logical structure without a lot of verbosity 

Correct. XML is for *storing* the metadata — in this case for posterity 
— it makes no judgment about how you or anyone else will use it.

> and allows a renderer to define layout info for the latex things.

Right. The project could have used LaTeX (it was seriously considered 
back when it was starting in 1989) but wiser heads prevailed.

You can already see in the extract above that an editor has annotated 
her corrections wherever she expanded a word to complete the spelling, 
with the <ex> element type. In print, this would be rendered [n] or 
perhaps an italic n or an underlined n — that's a formatting decision 
for the publisher. Using XML, you don't specify *how* it looks, only 
that it exists. Scholars need the non-committal format so they can do 
things like studying the scriptorial or linguistic aspects of editions, 
so being able to retrieve all occurrences of editorial interventions in 
their context is important to them, much more so than how to typeset it.

> Anyway, the point in posting this time is to ask about citing web pages.

Use biblatex for formatting, not BiBTeX, because the older formats tend 
not to have the right fields for citing web pages. See also


> For most articles intended to be cited, I had ways to scrape bibtex
> off the pages containing an abstract- if the link is on the
> clipboard the script can usually find a bibtex entry or a doi and
> call crossref.
Right. Scrapers are usually unreliable, even Zotero and Mendeley. Most 
journal pages have a download, often including a .bib file, but even if 
they only have RIS, you can still open that in JabRef and get the data 
saved in BiBTeX format.

> However, I need to make some arguments contrasted to "popular" or
> maybe news sites or cite commercial products that were mentioned in a
> work. Few of these provide bibtex for their pages although plenty
> have "share"  features. 

In those cases the only answer is to copy and paste into JabRef or 
whatever you use to manage your bibliography.

> AFAICT, even the CELT site did not provide much in the way of "how to
> cite" which is odd for their academic work and indeed confusing as
> you want to credit their work with displaying some other classic
> work. 

Yes, it's something missing which is on the list to implement. As I 
said, it's a new format and not everything is in place yet. However, 
very few people would ever need to cite the CELT *web page* itself. They 
would cite the quoted edition (which is why BiBTeX is provided in every 
document), and just add the URL as their link. The CELT editions can be 
treated exactly as the paper editions would be.

> Is there some obvious way anyone here would create a bibtex
> entry for the page above,

At the moment, only manually. But given your impetus, I can bump the 
priority level for providing this up a few notches. It's fairly complex 
because it needs some decisions taking over (eg) which version of the 
title to use, how many of the editors to cite (some documents have 
dozens), etc.

> and as an example of the commercial site, for example,
> ./toobib.h608  m_bib.format()=%2019-09-14:17:45:00
> %autogenerated by toobib
> @www{ZincCapsHighPotencylifeextension,
> authors = {},
> title = {Zinc Caps High Potency, 50 mg 90 capsules | Life Extension    },
> url = {https://www.lifeextension.com/vitamins-supplements/item01813/zinc-caps-high-potency},
> urldate = {2019-09-14:17:45:00},
> year = {}
> }

I would make that something like:

authors = {Life Extension Foundation},
title = {Zinc Caps High Potency, 50 mg 90 capsules},
url = 
urldate = {2019-09-14T17:45:00},
year = {2019},
address = {Fort Lauderdale, FL}

> The bibtex above is what I could scrape from the link using some code I wrote
> to do it automatically from the link itself, html fields like "title" and any
> "meta" it can find. 

Unless the page owner is aware of things like citation, that's probably 
all you'll ever get.

> Eventually I could chase down doi's or other cues, that is why I went
> from bash to c++, but hopefully it does not become that big a mess
I would have stuck with bash because of the huge range of facilities 
designed for text manipulation like tidy and the LTxml2 utilities.

> I guess if this worked well it would be nice to let publishers or 
> site owners use a similar tool to provide bibtex in a "how to cite" 
> button next to all the sharing stuff.

I doubt if they would be interested, to be honest.

> Google scholar probably did something like this to create their 
> bibtex but I was not sure if any of that is public or if other 
> mechanisms exist so I wrote my own code but it could be quite 
> involved and I'm not even sure how to use some of the fields. Is 
> there a style guide with this in it somewhere?

You can ask them :-)


More information about the texhax mailing list