[tug-summer-of-code] GSOC project proposal
Matthew Leingang
leingang at math.harvard.edu
Tue Mar 18 15:59:42 CET 2008
Begin forwarded message:
> If you are a TeX developer and have an idea for your package or
> program
> that could be usefully implemented by a student working full-time over
> the summer, and are (hopefully) willing to mentor the project, please
> email summer-of-code at tug.org with the details. See http://tug.org/
> gsoc/
> for the ideas we've posted so far.
Hello,
I don't know if I could be described as a TeX developer but I have
had 15 years of TeX and LaTeX experience and have programmed a
little. I have a project idea, which I could mentor with help. I
hope this isn't too long, but I've been thinking about it for a
number of years.
Project: Dublin Core metadata interface
"The Dublin Core Metadata Initiative is an open organization engaged
in the development of interoperable online metadata standards that
support a broad range of purposes and business models." [1] They
have developed an abstract framework for metadata [3] and several
machine-readable representations of metadata statements, among them
in the Resource Description Framework (RDF) [3] [4]. RDF itself has
several XML representations [5]. Dublin Core has a set of elements
to describe resources (for example, author, date created, format),
but a key element is modularity: applications can devise their own
elements and vocabularies to plug right in. It's also crucial that
as much as possible (element names and value vocabularies) be named
with URIs to avoid ambiguities.
One large user of RDF metadata is Adobe, creator of the PDF file
format. Adobe's eXtensible Metadata Platform (XMP) [6] allows PDF
creators to embed arbitrary metadata into a PDF file. This metadata
is visible to Adobe applications and a growing number of other search
and archive tools, including Mac OS X's Spotlight. XMP is
implemented in an XML representation of RDF.
It's currently possible, perhaps using Scott Pakin's hyperxmp package
[7], for a LaTeX author familiar with RDF to write XMP packets and
embed them in the PDF file produced by pdflatex. But the quantity
and quality of metadata increases with the ease of its interface, and
this workflow is not very easy. Contrast this with the simple LaTeX
commands \title, \author, \date, which are very easy to remember and
use. There's also a lot of metadata that could be automatically
discovered, such as document structure and references to other
resources.
The key deliverables for this project would be
* an implementation of the Dublin Core Abstract Model in TeX
* methods to export metadata from the abstract model to external
files in various formats, most importantly RDF+XML, maybe also DC-
TEXT [8] and N3 [9]
* in the case of pdflatex, automatic embedding of XMP packets into
the product PDF
* a user-friendly interface for making metadata statements
* methods for package authors to declare new metadata element sets
and vocabularies, in order for authors to write metadata specific to
their field of interest. Personally, I am thinking of Learning
Object Metadata [10]; however the mapping of LOM to Dublin Core is
problematic. [11]
Use case #1: A LaTeX author wants to write a document about
calculus. She inserts metadata commands into the document preamble
such as
\title{Graphing functions}
\author{Ms. Understood}
\subject{Calculus}
\subject[LCSH]{Mathematics -- Calculus -- Differential}
The first \subject statement would be readable by more agents (for
instance, people) but less structured. The second conforms to the
Library of Congress Subject Headings so a metadata reader aware of
that vocabulary could index the resource as such.
Upon LaTeX-ing the document, one or more RDF files are produced which
can be published on the web. An XML representation of the statements
above could be
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dcam="http://purl.org/dc/dcam/">
<rdf:Description rdf:about="http://uri/for/document">
<dcterms:creator>Ms. Understood</dcterms:creator>
<dcterms:title xml:lang="en">Graphing functions</dcterms:title>
<dcterms:subject xml:lang="en">Calculus</dcterms:subject>
<dcterms:subject>
<dcam:memberOf rdf:resource="http://loc.gov/LCSH" />
<rdf:value>Mathematics -- Calculus -- Differential</rdf:value>
</dcterms:subject>
</rdf:Description>
</rdf:RDF>
If pdflatex is used, XMP packets are written and embedded in the
PDF. An XMP packet looks something like [11]
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d" ?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF ...>
<!-- RDF document from above -->
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
This would allow better machine finding and indexing of the resources
present in the LaTeX document.
Use case #2: A Dublin Core implementation of a new metadata
application profile is produced, for instance LOM. A LaTeX package
author creates a package which provides simple commands connecting
the user interface to the abstract model. An example of a new
metadata element might be the LOM element
"Educational.Difficulty" [13], which could be implemented as the
simple control sequence \difficulty.
---
I think I had better quit. As you can see I've thought a lot about
this. What's been keeping me from trying to code this project
myself (besides time) is the data structure. I'm not yet at the
stage where I quite understand all the \expandafter's and
\futurelet's etc. needed to implement such things. As for time, I'm
switching jobs on July 1 so some of my summer will be spent on that
transition. So another mentor or two would help a lot.
Thanks for the opportunity to collect my thoughts. I hope the
project seems feasible and worthwhile.
--Matthew Leingang
[1] http://dublincore.org/
[2] http://dublincore.org/documents/abstract-model/
[3] http://www.w3.org/TR/rdf/
[4] http://dublincore.org/documents/dc-rdf/
[5] http://www.w3.org/TR/rdf-syntax-grammar/
[6] http://www.adobe.com/products/xmp/
[7] http://www.ctan.org/get/macros/latex/contrib/hyperxmp/hyperxmp.pdf
[8] http://dublincore.org/documents/dc-text/index.shtml
[9] http://www.w3.org/DesignIssues/Notation3.html
[10] http://en.wikipedia.org/wiki/Learning_object_metadata
[11] http://dublincore.org/educationwiki/DCMIIEEELTSCTaskforce
[12] http://www.adobe.com/devnet/xmp/pdfs/xmp_specification.pdf
[13] http://ltsc.ieee.org/wg12/files/LOM_1484_12_1_v1_Final_Draft.pdf
--
Matthew Leingang
Preceptor in Mathematics
Harvard University
http://www.math.harvard.edu/~leingang/vCard.vcf
More information about the summer-of-code
mailing list