Google Summer of Code and TUG

TUG will not participate in Google's Summer of Code program as a “mentoring organization” this year; our application has been rejected, but anyone who's interested can find the ideas' list we put together and take inspiration from it. There still is so much development to do in the TeX world!

In 2008, three students participated in SoC with TUG. You can see their projects and the code they produced, as well as TUG's announcement for 2008.

All organizational SoC-related discussions for TUG happen on the summer-of-code@tug.org mailing list; feel free to subscribe or peruse the archives.

Project ideas

Project ideas: - Accessible PDF from TeX - Dublin Core metadata and TeX - Handwritten LaTeX symbol recognition - Hyperlinked syntax highlighting for TeX - New document templates for LaTeX - LuaTeX port of fontspec - LaTeX3 microkernel -


1. Accessible PDF from TeX

This project aims at enriching pdfTeX with the ability to produce "tagged PDF" in accordance with PDF/A, PDF/UA, ISO 32000 and (the as yet unpublished) ISO-32000-2 specifications. This includes both Structure and Content tagging, for use with screen readers (for the visually impaired) and plug-in software that can display and enhance the structure and content of mathematical formulas.

Successful completion of this project will include the following tasks:

a) Identify all places within the standard LaTeX codebase where tagging for structure and/or content is appropriate.

b) Create file(s) containing alternate macro definitions which include software "hooks" that will enable tags to be placed at appropriate places within the output stream; i.e., into TeX's "vertical list".

c) Identify those places where modified macros are insufficient to give the best way to include structure or content tags. This will indicate a need to add extra capabilities to the underlying processing engine.

d) Produce a file of (La)TeX macro definitions which provide all the possible structure, content and mathematical tags that are supported in the PDF 1.7 (or later) specifications, and that will be supported in ISO-32000-2.

e) Produce "support" files that will allow the "hooks" in (b) to be bound to the macros defined in (d). Several such support files may be needed, providing different levels of support to meet the requirements of different levels of tagging; e.g. for PDF/X, PDF/A (various versions), PDF/UA and full tagging of mathematical formulas.

f) Produce "driver" files that control the way the bound hooks in (e) place the tags into the output produced by specific processing engines. Initially there is just the single engine (pdfTeX) that needs to be supported; but the coding should be sufficiently modular so that alternative engines (e.g., XeTeX and LuaTeX) may also be supported with changes only to the "driver" file. All the other coding modules should be able to be used, completely unmodified.

g) Produce appropriate documentation to describe what the various coding modules are supposed to do, and how they are used.

Completing these tasks will be a combined effort involving all the project participants (see below), not just the student. Certainly the student will need to be familiar with TeX macro programming techniques, and the way these are used with LaTeX internal macros.

Other aspects of programming, such as PDF/PostScript or Tangle & Weave or coding for Adobe plug-ins, would be regarded as a bonus that may prove useful, but is not essential.

The tex list at River Valley Technologies hosts discussions on this issue.

Personnel/Mentors:


2. Dublin Core metadata and TeX

The full text of the proposal is available; a summary follows.

Project summary:

The Dublin Core Metadata Initiative is an open organization engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models. They have developed an abstract framework for metadata and several machine-readable representations of metadata statements, among them in the Resource Description Framework (RDF).

One large user of RDF metadata is Adobe, creator of the PDF file format. Adobe's eXtensible Metadata Platform (XMP) allows PDF creators to embed arbitrary metadata into a PDF file. This metadata is visible to Adobe applications and a growing number of other search and archive tools, including Mac OS X's Spotlight. XMP is implemented in an XML representation of RDF.

The key deliverables for this project would be

  1. an implementation of the Dublin Core Abstract Model in TeX
  2. methods to export metadata from the abstract model to external files in various formats, most importantly RDF+XML, maybe also DC- TEXT and N3
  3. in the case of pdflatex, automatic embedding of XMP packets into the product PDF with a default minimum of the XMP expression of the Z39.88 OpenURL COinS fields, both for the document's own metadata and for all references cited and external hyperlinks.
  4. a user-friendly interface for making metadata statements
  5. in the absence of specific author declarations in a pdflatex document, as much metadata should be embedded as can be detected automatically
  6. methods for package authors to declare new metadata element sets and vocabularies, in order for authors to write metadata specific to their field of interest. Personally, I am thinking of Learning Object Metadata; however the mapping of LOM to Dublin Core is problematic.

Project mentors would be Peter Flynn and Matthew Leingang.


3. Recognition of hand-written LaTeX symbols

This project has no assigned mentor yet. Anyone interested in acting as a mentor is heartily invited to contact the mailing list.

The LaTeX typesetting system provides commands for typesetting thousands of different symbols needed to prepare documents in the fields of linguistics, mathematics, music, engineering, physics, and many others. A challenge for someone writing a document is to find the LaTeX name for a given glyph. Currently, the best solution is to refer to the Comprehensive LaTeX Symbol List , a collection of symbol tables organized into ad hoc categories and indexed by LaTeX symbol name. The problem with this approach is that different users associate different names to the same glyph, making searching difficult. Consider, for example, trying to find the LaTeX name for a circle with a dot in the middle. An astronomer may search for “sun”; a mathematician may search for “circumference”; a linguist may search for “click consonant”; a mapmaker may search for “city center”; someone writing about alchemy may search for “gold”. In fact, an entire Wikipedia page is devoted to listing the various meanings for this symbol). Non-English speakers are at a further disadvantage because most LaTeX symbols are named by English speakers.

We believe that a great aid to LaTeX users would be a Web-based symbol-search tool based on text recognition. That is, we imagine a Web page at which a user could draw a symbol then be shown a list of the LaTeX symbols (commands and rendered output) that best match the user's drawing. A student would need to evaluate the numerous options for recognizing hand-drawn symbol, find a suitable internal representation for the thousands of LaTeX symbols and associated metadata, construct a suitable user interface to interact with the symbol recognizer, and ensure that the resulting software is maintainable, especially given the frequency with which new symbols are added to LaTeX.

This is no doubt a challenging idea to implement. However, it is bound to be an exciting, rewarding experience because of the abundance of technologies involved: TeX/LaTeX, text recognition, various Web technologies, and probably multiple programming languages. There is much to learn, and TUG is eager to mentor a student with the interest and abilities to pull off a handwriting-to-LaTeX-symbol project.

(Proposal from Scott Pakin, but needs a mentor.)


4. Hyperlinked syntax highlighting for TeX code

It now common, when listing code on a web page, to provide syntax highlighting. Indeed, this is done on Google code.

This project is to provide syntax highlighting for TeX code – both documents and macros – with an extra feature. Each highlighted command is also hyperlink which offers tooltip help, and which when clicked brings up further documentation.

Two of the leading syntax highlighters are

This project has three parts. The first is providing enhanced syntax highlighting for TeX code. The second is creating a commands database. The third is linking together the first and second parts.

Depending on difficulties encountered this project might be too large. If so, we'd expect the student to do just a part of it.

Project mentor would be Jonathan Fine.


5. New document templates for LaTeX

Lamport provided LaTeX with a number of document templates, book, article, etc. Even today, a large percentage of LaTeX documents use these, resulting in a recognizable "LaTeX-ey look."

Authors also have available classes for specialized use, such as for a journal, for a particular conference, or for a thesis from a particular university. But these classes are quite often often adapted from Lamport's templates, and continue the look. On the other hand, surveying existing classes that do provide a substantially different look, such as koma-script's scr* and the French Mathematical Society's smfart would be a useful part of this project.

This project is to provide alternative templates for broad usage. These may be also suitable for books and articles, or may be suitable for other purposes. Ideally they would come with guidance for potential users as to circumstances suggesting their use (e.g., for books largely without mathematics, or for automatically-generated texts).

Project mentor would be Jim Hefferon.


6. LuaTeX port of fontspec

XeTeX is a very popular extension of TeX that makes it possible to use TrueType and OpenType fonts without resorting to Type 1 or TFM files. Part of its success is due to Will Robertson's fontspec package which gives XeTeX's font-loading primitives a LaTeX interface. The goal of this project is to port fontspec to LuaTeX.

Because XeTeX and LuaTeX represent very different paradigms in extending TeX, the LuaTeX part of fontspec would be significantly different from the already existing code for XeTeX: while the latter uses a lot of additional system libraries to add advanced font support to TeX, the former provides a bridge between TeX and Lua and enables to hook Lua code into TeX's typesetting engine. Therefore, most of the work would be to implement advanced font features in Lua. This is what ConTeXt “Mark IV” version already does.

Several aspects need to be addressed:

Font lookup: In order for LuaTeX to be able to use system fonts like XeTeX does, we need to implement a lookup mechanism for those. Several options are possible:

These abilities would of course come in addition to the kpathsea library, that can also be called from Lua. We would therefore have a dual situation very similar to XeTeX's.

Font loading: At the engine level, LuaTeX's \font primitive behaves in roughly the same way as pdfTeX's, and can't load TrueType or OpenType font files. It expects a TFM file by default. We need to “overload” the primitive in order to emulate XeTeX's behaviour.

OpenType layout There also are several options:

LaTeX interface: Ideally, the LuaTeX port of fontspec would have the same high-level interface as the current package.

Project mentors would be Will Robertson and Arthur Reutenauer.


7. Initial LaTeX3 microkernel

The LaTeX typesetting system has for many years meant LaTeX2e. Recent developments on the successor system, LaTeX3, have focussed mainly on a new low-level programming system for TeX. As this low-level work reaches maturity, applying the new coding ideas to higher level work is becoming possible.

The aim of the LaTeX3 “microkernel” project is to begin to examine how the low-level system can be applied to providing a system which can be used to typeset simple LaTeX2e-like documents without needing to load on top of the current LaTeX2e kernel. As an initial target, the basic document

\documentclass{minimal}
\begin{document}
\emph{Hello World!}
\end{document}
would be used as a test case.

The microkernel described here is not intended to be a complete implementation of the LaTeX2e kernel (latex.ltx). There are a number of as-yet unanswered questions concerning user interface for LaTeX3. The project aims are to build a base stand-alone kernel, which can then be extended slowly to implement more features (most probably from latex.ltx, but possibly taken from LaTeX add-on packages). Additions such as basic sectioning commands and environments (lists, alignment, etc.) would be obvious steps to undertake after successfully producing as system capable of working with the test document. Certain areas can also be ruled out: the New Font Selection Scheme, complex output routines and floating content are all beyond the scope of the project.

Project mentor would be Joseph Wright.



$Date: 2009/03/19 14:18:52 $;
TUG home page; contact webmaster; (via Google)