[tug-summer-of-code] GSOC project proposal

Matthew Leingang leingang at math.harvard.edu
Tue Mar 18 15:59:42 CET 2008


Begin forwarded message:

> If you are a TeX developer and have an idea for your package or  
> program
> that could be usefully implemented by a student working full-time over
> the summer, and are (hopefully) willing to mentor the project, please
> email summer-of-code at tug.org with the details.  See http://tug.org/ 
> gsoc/
> for the ideas we've posted so far.

Hello,

I don't know if I could be described as a TeX developer but I have  
had 15 years of TeX and LaTeX experience and have programmed a  
little.  I have a project idea, which I could mentor with help.  I  
hope this isn't too long, but I've been thinking about it for a  
number of years.

Project: Dublin Core metadata interface

"The Dublin Core Metadata Initiative is an open organization engaged  
in the development of interoperable online metadata standards that  
support a broad range of purposes and business models." [1]  They  
have developed an abstract framework for metadata [3] and several  
machine-readable representations of metadata statements, among them  
in the Resource Description Framework (RDF) [3] [4].  RDF itself has  
several XML representations [5].  Dublin Core has a set of elements  
to describe resources (for example, author, date created, format),  
but a key element is modularity: applications can devise their own  
elements and vocabularies to plug right in.  It's also crucial that  
as much as possible (element names and value vocabularies) be named  
with URIs to avoid ambiguities.

One large user of RDF metadata is Adobe, creator of the PDF file  
format.  Adobe's eXtensible Metadata Platform (XMP) [6] allows PDF  
creators to embed arbitrary metadata into a PDF file.  This metadata  
is visible to Adobe applications and a growing number of other search  
and archive tools, including Mac OS X's Spotlight.  XMP is  
implemented in an XML representation of RDF.

It's currently possible, perhaps using Scott Pakin's hyperxmp package  
[7], for a LaTeX author familiar with RDF to write XMP packets and  
embed them in the PDF file produced by pdflatex.  But the quantity  
and quality of metadata increases with the ease of its interface, and  
this workflow is not very easy.  Contrast this with the simple LaTeX  
commands \title, \author, \date, which are very easy to remember and  
use.  There's also a lot of metadata that could be automatically  
discovered, such as document structure and references to other  
resources.

The key deliverables for this project would be

* an implementation of the Dublin Core Abstract Model in TeX
* methods to export metadata from the abstract model to external  
files in various formats, most importantly RDF+XML, maybe also DC- 
TEXT [8] and N3 [9]
* in the case of pdflatex, automatic embedding of XMP packets into  
the product PDF
* a user-friendly interface for making metadata statements
* methods for package authors to declare new metadata element sets  
and vocabularies, in order for authors to write metadata specific to  
their field of interest.  Personally, I am thinking of Learning  
Object Metadata [10]; however the mapping of LOM to Dublin Core is  
problematic. [11]

Use case #1: A LaTeX author wants to write a document about  
calculus.  She inserts metadata commands into the document preamble  
such as

\title{Graphing functions}
\author{Ms. Understood}
\subject{Calculus}
\subject[LCSH]{Mathematics -- Calculus -- Differential}

The first \subject statement would be readable by more agents (for  
instance, people) but less structured.  The second conforms to the  
Library of Congress Subject Headings so a metadata reader aware of  
that vocabulary could index the resource as such.

Upon LaTeX-ing the document, one or more RDF files are produced which  
can be published on the web.  An XML representation of the statements  
above could be

<rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:dcterms="http://purl.org/dc/terms/"
     xmlns:dcam="http://purl.org/dc/dcam/">
   <rdf:Description rdf:about="http://uri/for/document">
     <dcterms:creator>Ms. Understood</dcterms:creator>
     <dcterms:title xml:lang="en">Graphing functions</dcterms:title>
     <dcterms:subject xml:lang="en">Calculus</dcterms:subject>
     <dcterms:subject>
       <dcam:memberOf rdf:resource="http://loc.gov/LCSH" />
       <rdf:value>Mathematics -- Calculus -- Differential</rdf:value>
     </dcterms:subject>
   </rdf:Description>
</rdf:RDF>


If pdflatex is used, XMP packets are written and embedded in the  
PDF.  An XMP packet looks something like [11]

<?xpacket begin="&#xFEFF;" id="W5M0MpCehiHzreSzNTczkc9d" ?>
   <x:xmpmeta xmlns:x="adobe:ns:meta/">
     <rdf:RDF ...>
     <!-- RDF document from above -->
     </rdf:RDF>
   </x:xmpmeta>
<?xpacket end="w"?>

This would allow better machine finding and indexing of the resources  
present in the LaTeX document.

Use case #2: A Dublin Core implementation of a new metadata  
application profile is produced, for instance LOM.  A LaTeX package  
author creates a package which provides simple commands connecting  
the user interface to the abstract model.  An example of a new  
metadata element might be the LOM element  
"Educational.Difficulty" [13], which could be implemented as the  
simple control sequence \difficulty.

---

I think I had better quit.  As you can see I've thought a lot about  
this.  What's been keeping me from trying to code this project  
myself  (besides time) is the data structure.  I'm not yet at the  
stage where I quite understand all the \expandafter's and  
\futurelet's etc. needed to implement such things.  As for time, I'm  
switching jobs on July 1 so some of my summer will be spent on that  
transition.  So another mentor or two would help a lot.

Thanks for the opportunity to collect my thoughts.  I hope the  
project seems feasible and worthwhile.

--Matthew Leingang

[1] http://dublincore.org/
[2] http://dublincore.org/documents/abstract-model/
[3] http://www.w3.org/TR/rdf/
[4] http://dublincore.org/documents/dc-rdf/
[5] http://www.w3.org/TR/rdf-syntax-grammar/
[6] http://www.adobe.com/products/xmp/
[7] http://www.ctan.org/get/macros/latex/contrib/hyperxmp/hyperxmp.pdf
[8] http://dublincore.org/documents/dc-text/index.shtml
[9] http://www.w3.org/DesignIssues/Notation3.html
[10] http://en.wikipedia.org/wiki/Learning_object_metadata
[11] http://dublincore.org/educationwiki/DCMIIEEELTSCTaskforce
[12] http://www.adobe.com/devnet/xmp/pdfs/xmp_specification.pdf
[13] http://ltsc.ieee.org/wg12/files/LOM_1484_12_1_v1_Final_Draft.pdf

--
Matthew Leingang
Preceptor in Mathematics
Harvard University

http://www.math.harvard.edu/~leingang/vCard.vcf





More information about the summer-of-code mailing list