JCDL 2006:  Opening Information Horizons

Metadata Tools for Digital Resource Repositories Workshop

June 15, 2006, Chapel Hill, NC, USA

A Tiny Retrieval Protocol: THUMP and Kernel Metadata

John Kunze, Preservation Technologist for the California Digital Library
John Kunze is a preservation technologist for the California Digital Library and has a background in computer science and mathematics. His current work focuses on archiving websites, creating long-term durable digital references (ARKs) to information objects, and specifying lightweight (kernel) metadata. Prior work includes major contributions to the standardization of URLs, Dublin Core metadata, and the Z39.50 search and retrieval protocol. In an earlier life he designed, wrote, and ran UC Berkeley's first campus-wide information system, which was an early rival and client of the World Wide Web. Before that he was a BSD Unix hacker whose work survives in today's Linux and Apple systems.
Kevin A. Gamiel, Research Programmer, Renaissance Computing Institute
Kevin Gamiel is a research software developer with the Renaissance Computing Institute (RENCI), a collaborative venture of Duke University, North Carolina State University, the University of North Carolina at Chapel Hill and the state of North Carolina. Current work involves an NIH-funded multidisciplinary approach to exploratory genetic analysis, he leads the NC and TeraGrid Bioportal projects, performance analysis of high performance codes on hundreds of compute nodes, and a number of other RENCI efforts. Past work includes contribution to the Dublin Core metadata effort and the Z39.50 standard, Kevin was co-chair of the Networked Information Retrieval (NIR) and Integration of Internet Information Resources (IIIR) IETF working groups.

Web information retrieval designs cycle naturally between periods of expanding functionality and contracting complexity. This talk presents a contraction-phase design that tries to retain the best features of modern retrieval designs while being very easy to implement. Leveraging existing search systems, it calls for an extra external interface but otherwise requires no internal system changes.
The new interface is specified by THUMP -- The HTTP URL Mapping Protocol -- a very lightweight protocol that can be used for focused, known-item retrievals and broad search engine queries. To keep implementation barriers low, the interface can be thoroughly tested with ubiquitous tools such as web browsers and the telnet remote login software. The talk will address implementation experiences in a scientific computing context.
THUMP returns information in the form of an Electronic Resource Citation (ERC), a simple, compact, and printable record designed to hold data associated with information objects. By design, the ERC is a metadata format that balances the needs for expressive power, very simple machine processing, and direct human manipulation. The ERC uses a "kernel" subset of four required metadata elements defined by a working group of the Dublin Core Metadata Initiative.
