[tex-live] [LONG] Improving TeX package classification and the associated documentaion

George N. White III gnwiii at gmail.com
Mon Jul 2 14:35:51 CEST 2007


On 7/2/07, Florent Rougon <f.rougon at free.fr> wrote:

> Norbert Preining <preining at logic.at> wrote:
>
> > On Mon, 02 Jul 2007, Florent Rougon wrote:
> >>   Each package (in the sense of CTAN package, not Debian) contains an
> >>   XML file that specifies the following:
> >
> > How will we ever get at this?
>
> Well, probably not, but I don't think it's a big problem:
>   - if package maintainers write their own metadata, well, that's great
>   - the most important packages (geometry, graphicx, etc.) will get
>     tagged in a reasonable timeframe by either their maintainer or by
>     volonteers (yeah, could be me, but preferably some LaTeX guru :).
>   - each package that is not tagged is simply not accessible through the
>     tag-based search or browsing facilities in the tool I'm thinking
>     about. We can still have an alphabetic classification and a poor
>     man's search that simply looks at package names. These packages are
>     just (much) less easy to find. If someone likes one of them and
>     finds it's too difficult to find, then he can tag it, dammit!
>
> Basically, the database need not be complete to be useful. Having only
> the "important" packages tagged would already be very helpful to users,
> I think.

Agreed.  TeX Live contains both "current best practice packages" and old,
not maintained,packages that are used by legacy documents.  Documenting
the latter is not so critical, since you can work back to the documentation
from the package or macro file if the legacy document has a problem.

> > Furthermore I would prefer *not* to have new stuff on CTAN.
>
> I'm not sure what the problem with "new stuff" is, but...

You have to be careful about generating more work for current maintainers
and more confusion for google users getting hits on files they don't understand,
e.g., when searching for "hyperref manual".

> > I would suggest to add this information in some way
> > to the TeX Catalogue
>
> That can be done. But then, the burden of tagging the packages relies on
> the shoulders of the sole catalogue maintainers, and I fear that this
> way, we cannot ever have any significant part of CTAN tagged (I'm not
> implying that the catalogue maintainers are lazy, rather that CTAN is
> huge).

The TeX Catalogue has a lot of the metadata, so it makes sense to look for
ways to leverage this effort.

> That's mainly why I wanted the metadata to be part of CTAN packages:
> careful package authors will tag their packages in a good way. That
> would make their packages easier to find without causing any additional
> work for the catalogue maintainers.

There are several ways to add tags:

1) preferred --  packages provide tag information according to some standards
that allow automated inclusion in the catalogue.

2) least effort -- generate some best guess defaults from information in the
catalogue (including "not yet tagged")

3) final resort -- for important packages, manually add tags when the
other 2 mechanisms have failed.

> If you really want the metadata to be part of the catalogue, there's
> still a way to have it updated "collectively": using a web interface or
> a custom network client. But doing it properly requires validation of
> the tagging by some devoted soul, so this is complex and requires quite
> some additional work. Besides, web interfaces are not my area; I am not
> volonteering to write anything like that. A custom client can easily be
> made portable to current systems, yes, but this is real work (well,
> actually, something like that already exists in debtags and the client
> sends "tag patches", but I'm not sure the client is portable enough and
> can be reused for cataloguing CTAN instead of Debian).
>
> > (where it belongs)
>
> Hmmm, depends on the POV. :)
>
> Yes, if you look at the current state of catalogue implementation, that
> is where it belongs. But you can't say it's not natural to have the
> metadata embedded in each package (basically, you're telling that we
> should upload our Debian packages without debian/control and then copy
> debian/control ourselves for each upload somewhere on
> ftp-master.debian.org. Ugh! :).

You have to work with packages as they come, the best you can do is:

1) encourage authors to include metadata (in a useful form) -- I suspect most
would be happy to fill in some template if they knew it would be used.

2) failing author-provided metadata, avoid creating multiple meta-data
repositories (duplication of effort, confusion when they diverge,
etc.)

> > Another problem is that sometimes packages on CTAN don't directly ship
> > documentation files, but they have to be created.
>
> Ah, that is indeed a problem I hadn't thought about. In these cases,
> what happens on the catalogue side?
>
>   (1) No doc is listed on the web interface.
>
> or
>
>   (2) The catalogue or CTAN maintainers build the doc themselves, store
>       it in CTAN and point to it from the catalogue.

The user will get the doc when they install the package -- the metadata just
needs to provide enough so they can judge whether it is worthwhile to install
the package.

> If (1), then we (TeX Live) are on our own and have to build the doc
> ourselves. I won't develop this case for now because this becomes a bit
> messy and have the impression that a better solution would be to enforce
> that each CTAN upload has the full documentation built. But if there are
> good reasons against this, I can devise solutions.

Again, you have to work with what you get -- no standards are enforced.

> If (2), then I think the CTAN package should be stored in what I'll call
> "definitive form" *with* its documentation, and the metadata (be it in
> the package or the catalogue) could then point to the various doc files
> present in the package. Then, we're back to square one and can follow my
> proposal.
>
> This has an important implication: that TeX Live adopts the same format
> for documentation as CTAN. Yes, I know this won't make everyone happy,
> but:
>
>   - I believe this is the simplest and cleanest way from the POV of
>     information structure (package/documentation/metadata);
>
>   - In many cases, there is an optimal format for a given documentation:
>     if there are figures, DVI is ruled out and we need either PS or PDF.
>     If there are links from one file to another, I believe PS is ruled
>     out too. Moreover, PDF files with the navigation table (hierarchical
>     bookmarks) on the left are far more convenient than PS files without
>     such a table for not-so-short doc files IMHO. So, you should be able
>     to guess my preferred format for most doc. :)
>
> > If we restrict ourselves to TeX Live we can use the
> >       docfiles
> > entry in the TeX Live database, but there are no tags on these files in
> > any way (and currently no way to tag them, but this can be changed).
>
> I believe it would be a shame to restrict ourselves to TeX Live, but if
> we don't adopt the same doc format as on CTAN, I think this will have to
> happen.
>
> > If we have to tag all the stuff that would be impossible. OTOH we cannot
> > urge the package writers to write some tag specification.

If there is a template and some information showing how and why the
data will be useful some authors will provide it, but certainly you have to
deal with cases where it is not provided or is somehow broken.

> As said previously, the database need not be complete to be useful...

It will be more useful if it can be trusted to include important widely used
packages and offers fallbacks.

> > Furthermore there are those packages which are not supported anymore,
> > i.e., no one is responsible for them.
>
> These can be tagged! You can then trivially avoid cluttering your search
> with obsolete packages.
>
> Unmaintained packages are a slightly different case. Often, you will
> prefer a maintained package to an unmaintained one providing the same
> functionality, but that is not always the case. So, in this case, the
> tag is not necessarily used as a binary filter, but can be used as a
> hint among others helping the user make his choice.
>
> > One way out of this dilemma I see is:
> > - we migrate the CTAN upload procedure to the experimental one currently
> >   in testing phase
>
> Sorry, I don't know what this experimental procedure consists of.
>
> > - we encourage package writers at upload time to add some tags (drop
> >   down lists, checkboxes, whatever)

Better to have a one-time mechanism where the tag information is included
once and the author doesn't have to do anything more unless there is a
major change in the package status.   I'd suggest a special line in the
README file for the author's tags (while still allowing ctan2tl, etc.
to add more tags).

> Yes, that can be done. But I have the impression you're seeing it as a
> web interface, and I repeat I'm not the one who will code it. That said,
> I'm not opposed to such an interface, if done correctly.
>
> > - documentation files could be gathered automatically from the ctan2tl
> >   script which would be executed in the background
>
> I suppose this is the magic script that will be able to tell me where
> each doc file is installed. OK. What we need is:
>
>   - installation path for each file;
>
>   - the CTAN package it belongs to;
>
>   - ideally, some way to link each file to its metadata, so that we can
>     know the language each doc file is written in and can tell when a
>     file is an index (entry point) in the list of doc files installed by
>     a package.
>
>     As said above, this should be easy to do if the files installed are
>     the same as in the (definitive) CTAN packages, because then the
>     metadata can be part of the catalogue or of the CTAN package;
>     otherwise (e.g., if CTAN has no doc or the doc in other formats),
>     matching the metadata from CTAN with the files installed in TL
>     becomes messy IMO.

Messy, but but unavoidable.  I think CTAN gets whatever an author wants
to provide, so it is up to each distribution to create missing doc files
in the preferred format for the distro.

> > This way slowly stuff would get tagged, and for old packages one of us
> > would have to take the work to go through them.
>
> Well, I don't think some volonteer will ever tag the whole CTAN. Too
> much boring work. A set of volonteers could probably, yes, but it is
> more likely that old unmaintained packages will remain untagged. That
> said, it is not a regression from the current state, and I don't think
> it's a real problem. Obsolete/obscure packages are more difficult to
> find[1], so what?...

Some heuristics can be applied to generate default tags using the information
that is readily available (one tag can indicate that only default tags were
applied).

> As far as TL (as opposed to CTAN) is concerned, that might be a bit
> different, as the number of packages is, I think, more manageable. But
> probably the same will happen (obsolete/obscure packages untagged, at
> least during the first years), just on a smaller scale.

If you start with TL you will be accomplishing something important in making
TL a viable successor to teTeX, which had advantages of a more uniform
platform and fewer packages.

There needs to be some thought for integrating local packages.   With tetex,
some administrator just copied files to the local texmf tree.  It is
getting to the point where (with lots of linux workstations, many
controlled by users with only vague understanding of texmf trees) we
need a way to create local packages that can be installed/removed
using the TL tools.

-- 
George N. White III <aa056 at chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia


More information about the tex-live mailing list