Project: Dublin Core metadata interface

This is a project idea for TUG's anticipated participation in Google's Summer of Code 2009. Main TUG/GSoC page. The mentors would be Peter Flynn (University College Cork, Ireland) and Matthew Leingang (Courant Institute of Mathematical Sciences, New York). If you're interested, please contact the TUG GSoC mailing list.

The Dublin Core Metadata Initiative is an open organization engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models.” They have developed an abstract framework for metadata and several machine-readable representations of metadata statements, among them in the Resource Description Framework (RDF). RDF itself has several XML representations. Dublin Core has a set of elements to describe resources (for example, author, date created, format), but a key element is modularity: applications can devise their own elements and vocabularies to plug right in. It's also crucial that as much as possible (element names and value vocabularies) be named with URIs to avoid ambiguities.

One large user of RDF metadata is Adobe, creator of the PDF file format. Adobe's eXtensible Metadata Platform (XMP) allows PDF creators to embed arbitrary metadata into a PDF file. This metadata is visible to Adobe applications and a growing number of other search and archive tools, including Mac OS X's Spotlight. XMP is implemented in an XML representation of RDF.

At the same time, online bibliographic data-gathering utilities like Zotero provide users with the ability to capture bibliographic metadata in Z39.88 or RDF format embedded in web pages (or PDF documents) directly into their personal database, from where it can be saved in BIBTeX and other reference formats for immediate use. A PDF document can thus contain not only its own metadata, but also the metadata of all external references which it cites (its Bibliography and hyperlinks, for example).

It's currently possible, perhaps using Scott Pakin's hyperxmp package, for a LaTeX author familiar with RDF to write XMP packets and embed them in the PDF file produced by pdflatex. But the quantity and quality of metadata increases with the ease of its interface, and this workflow is not very easy. Contrast this with the simple LaTeX commands \title, \author, \date, which are very easy to remember and use. There's also a lot of metadata that could be automatically discovered, such as document structure and references to other resources.

The key deliverables for this project would be

  1. an implementation of the Dublin Core Abstract Model in TeX
  2. methods to export metadata from the abstract model to external files in various formats, most importantly RDF+XML, maybe also DC- TEXT and N3
  3. in the case of pdflatex, automatic embedding of XMP packets into the product PDF with a default minimum of the XMP expression of the Z39.88 OpenURL COinS fields, both for the document's own metadata and for all references cited and external hyperlinks.
  4. a user-friendly interface for making metadata statements
  5. in the absence of specific author declarations in a pdflatex document, as much metadata should be embedded as can be detected automatically
  6. methods for package authors to declare new metadata element sets and vocabularies, in order for authors to write metadata specific to their field of interest. Personally, I am thinking of Learning Object Metadata; however the mapping of LOM to Dublin Core is problematic.

Use case #1: A LaTeX author wants to write a document about calculus. She inserts metadata commands into the document preamble such as

\title{Graphing functions}
\author{Ms. Understood}
\subject{Calculus}
\subject[LCSH]{Mathematics -- Calculus -- Differential}

The first \subject statement would be readable by more agents (for instance, people) but less structured. The second conforms to the Library of Congress Subject Headings so a metadata reader aware of that vocabulary could index the resource as such.

Upon LaTeX-ing the document, one or more RDF files are produced which can be published on the web. An XML representation of the statements above could be

<rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:dcterms="http://purl.org/dc/terms/"
     xmlns:dcam="http://purl.org/dc/dcam/">
   <rdf:Description rdf:about="http://uri/for/document">
     <dcterms:creator>Ms. Understood</dcterms:creator>
     <dcterms:title xml:lang="en">Graphing functions</dcterms:title>
     <dcterms:subject xml:lang="en">Calculus</dcterms:subject>
     <dcterms:subject>
       <dcam:memberOf rdf:resource="http://loc.gov/LCSH" />
       <rdf:value>Mathematics -- Calculus -- Differential</rdf:value>
     </dcterms:subject>
   </rdf:Description>
</rdf:RDF>

If pdflatex is used, XMP packets are written and embedded in the PDF. An XMP packet looks something like

<?xpacket begin="&#xFEFF;" id="W5M0MpCehiHzreSzNTczkc9d" ?>
   <x:xmpmeta xmlns:x="adobe:ns:meta/">
     <rdf:RDF ...>
     <!-- RDF document from above -->
     </rdf:RDF>
   </x:xmpmeta>
<?xpacket end="w"?>

This would allow better machine finding and indexing of the resources present in the LaTeX document.

Use case #2: A Dublin Core implementation of a new metadata application profile is produced, for instance LOM. A LaTeX package author creates a package which provides simple commands connecting the user interface to the abstract model. An example of a new metadata element might be the LOM element “Educational.Difficulty”, which could be implemented as the simple control sequence \difficulty.

Use case #3: A LaTeX author is writing a document, and wants to make it easily citable by others [this may eventually be a Requirement of universities, in order to increase their citation profile]. He uses the conventional minimum:

\author{AN Other}
\title{All we know about Gnus}

but also uses \includepackage{autocite} (or whatever it gets called), which will generate XMP metadata for the author, title, implicit date (\today in ISO 8601 format), the document name (\jobname), the document extent (bytes, if not words), document type (book/article/etc), and as much else as can be detected or deduced, including the metadata from all citations made, and from any hyperlinks to external resources. Shims for popular document classes like memoir, kluwer, elsevier, etc should be developed so that metadata from their additional fields (e.g., \submissiondate, \journalname) can be added.


$Date: 2009/02/17 17:30:21 $;
TUG home page; webmaster; facebook; twitter; mastodon;   (via DuckDuckGo)