Harvest Papers

Project Overview Papers
The Gatherer
The Broker
The Index/Search Subsystem
The Replicator
The Object Cache
The Harvest Object System

Project Overview Papers

An early version of the Harvest home page became the basis for the paper:
C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber and Michael F. Schwartz. The Harvest Information Discovery and Access System. Proceedings of the Second International World Wide Web Conference, pp. 763-771, Chicago, Illinois, October 1994.
Note that the home page continues to evolve, however.
The Harvest system is discussed in more depth in the paper:
C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, Michael F. Schwartz, and Duane P. Wessels. Harvest: A Scalable, Customizable Discovery and Access System. Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder, August 1994 (revised March 1995). Submitted for publication.
Abstract: Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly difficult to use effectively. In this paper we introduce Harvest, a system that provides an integrated set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. The system interoperates wit WWW clients and with HTTP, FTP, Gopher, and NetNews information resources. We discuss the design and implementation of Harvest and its subsystems, give examples of its uses, and provide measurements indicating that Harvest can significantly reduce server load, network traffic, and space requirements when building indexes, compared with previous systems. We also discuss several popular indexes we have built using Harvest, underscoring the customizability and scalability of the system.
Harvest is based on the architectural motivations described in the paper:
C. Mic Bowman, Peter B. Danzig, Udi Manber and Michael F. Schwartz. Scalable Internet Resource Discovery: Research Problems and Approaches. Communications of the ACM, 37(8), pp. 98-107, August 1994.
Abstract: Over the past several years, a number of information discovery and access tools have been introduced in the Internet, including Archie, Gopher, Netfind, and WAIS. These tools have become quite popular, and are helping to redefine how people think about wide-area network applications. Yet, they are not well suited to supporting the future information infrastructure, which will be characterized by enormous data volume, rapid growth in the user base, and burgeoning data diversity. In this paper we indicate trends in these three dimensions and survey problems these trends will create for current approaches. We then suggest several promising directions of future resource discovery research, along with some initial results from projects carried out by members of the Internet Research Task Force Research Group on Resource Discovery and Directory Service.

The Gatherer

The Gatherer's customized information extraction capability is supported by the Essence system, described in the paper:
Darren R. Hardy and Michael F. Schwartz. Customized Information Extraction as a Basis for Resource Discovery. Technical Report CU-CS-707-94, Department of Computer Science, University of Colorado, Boulder, March 1994 (revised February 1995). To appear, ACM Transactions on Computer Systems.
Abstract: "Indexing file contents is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. In this paper we present a model for type-specific, user-customizable information extraction, and a system implementation called \fIEssence\fP. This software structure allows users to associate specialized extraction methods with ordinary files, providing the illusion of an object-oriented file system that encapsulates indexing methods within files. By exploiting semantics of common file types, Essence generates compact yet representative file summaries that can be used to improve both browsing and indexing in resource discovery systems. Essence can extract information from most of the types of files found in common file systems, including files with nested structure (such as compressed ``tar'' files). Essence interoperates with a number of different search/index systems (such as WAIS and Glimpse), as part of the Harvest system."
Essence is also the partial subject of the MS thesis:
Darren R. Hardy. Scalable Internet Resource Discovery Among Diverse Information. Technical Report CU-CS-650-93, Department of Computer Science, University of Colorado, Boulder, May 1993. M.S. Thesis.
Abstract: Internet users need tools to help them discover resources that are available throughout the network. Unfortunately, current Internet resource discovery systems are inadequately prepared for the Internet's rapid growth. In particular, scalable techniques that adapt to growth in information diversity will provide a basis for future Internet resource discovery systems. To explore some techniques, we have built two resource discovery systems: Dynamic WAIS, which provides seemless information-level access to remote search systems through the Wide Area Information Servers (WAIS) interface; and Essence, which exploits file semantics to index a variety of types of data.

The Broker

The Harvest Broker is described in the MS thesis:
William G. Camargo. The Harvest Broker. M.S. Thesis, Computer Science Department, Pennsylvania State University, October 1994.
Abstract: The Internet currently encompasses thousands of hosts and terra-bytes of publicly accessible information. However, it is growing exponentially and with this expansion comes complexity. As a result finding useful information becomes more difficult. What's more, the load on popular archives increases with the number of users. This reduces quality of service and performance. Obviously, improving network and server performance is imperative. The World-Wide Web provides an architecture for browsing and accessing information on the network, but lacks a search facility for locating information. Harvest is a distributed architecture that supports object (file, package, etc) location in the Internet. The Harvest architecture consists of four parts as seen in Figure 1.1: the Gatherer, the Broker, the Object Cache and Replication Manager. The Gatherer summarizes information at archive sites and creates object summaries called Summary Object Interface Format (SOIF) objects. The Broker provides a network-accessible object database, formed by collecting SOIF objects from Gatherers and other Brokers. This maintains a loose consistency between the Broker databases and the files found on gathered archives. Harvest users search for archived objects through a query interface to the Broker's database of object summaries. To decrease network and server load, the Replication Manager maintains several identical copies of a Broker. The Object cache maintains local copies of popular objects to improve access performance. This paper describes the Harvest Broker.

The Index/Search Subsystem

One of Harvest's Index/Search subsystems is the Glimpse system, described in the paper:
Udi Manber and Sun Wu. GLIMPSE: A Tool to Search Through Entire File Systems. Proceedings of the USENIX Winter Conference, pp. 23-32, San Francisco, California, January 1994.
Abstract: GLIMPSE, which stands for GLobal IMPlicit SEarch, provides indexing and query schemes for file systems. The novelty of glimpse is that it uses a very small index in most cases 2-4% of the size of the text and still allows very flexible full-text retrieval including Boolean queries, approximate matching (i.e., allowing misspelling), and even searching for regular expressions. In a sense, glimpse extends agrep to entire file systems, while preserving most of its functionality and simplicity. Query times are typically slower than with inverted indexes, but they are still fast enough for many applications. For example, it took 5 seconds of CPU time to find all 19 occurrences of Usenix AND Winter in a file system containing 69MB of text spanning 4300 files. Glimpse is particularly designed for personal information, such as one's own file system. The main characteristic of personal information is that it is non-uniform and includes many types of documents. An information retrieval system for personal information should support many types of queries, flexible interaction, low overhead, and customization. All these are important features of glimpse.
Another of Harvest's Index/Search subsystems is the Nebula typed file system, described in the paper:
C. Mic Bowman, Chanda Dharap, Mrinal Baruah, Bill Camargo and Sunil Potti. A File System for Information Management. Proceedings of the Conference on Intelligent Information Management Systems, Washington, DC, June 1994.
Abstract: Nebula is unique in that it integrates the functionality of file systems with the support provided by information services. The Nebula file system explicitly supports information management across a wide-area file system. Nebula is different from traditional systems in three important ways. First, Nebula implements files as structured, typed file objects. Nebula implements these as a set of attributes. Each attribute describes some property of the file such as owner, protection, functions defined, sections specified, project, or file type. Second, Nebula provides associative access to files within a scoped index. Finally, Nebula replaces traditional directories with views, which dynamically classify information according to the needs of the user. A view is a query that selects objects from a scoped index. Nebula views provide a very powerful new file location technique that combines browsing and searching; we use well-known and well understood browsing and searching commands like cd, ls, and cat.

The Replicator

The Harvest Replication subsystem is described in the paper:
Peter Danzig, Katia Obraczka, Dante DeLucia and Naveed Alam. Massively Replicating Services in Autonomously Managed Wide-Area Internetworks. Technical Report, University of Southern California, January 1994.
Abstract: Current and future Internet services will provide a large, rapidly evolving, highly accessed, yet autonomously managed information space. Internet news, perhaps, is the closest existing precursor to such services. It permits autonomous updates, is replicated at thousands of autonomously managed sites, and manages a large database. It gets its performance through massive replication. This paper proposes a scalable mechanism for replicating wide-area, autonomously managed services. We target replication degrees of tens of thousands of weakly consistent replicas. For efficiency, our mechanism probes the network and computes a good logical topology over which to send updates. For scalability, we organize replicas into hierarchical replication groups, analogous to the Internet's autonomous routing domains. We argue that efficient, massive replication does not have to rely on internet multicast.
The Harvest Replication subsystem is also the subject of the PhD thesis:
Katia Obraczka. Massively Replicating Services in Wide-Area Internetworks. Ph.D. Dissertation, Technical Report USC-CS 94-595, Computer Science Department, University of Southern California, Los Angeles, California, December 1994.
Abstract: Existing and future Internet information services will provide large, rapidly evolving, highly accessed, yet autonomously managed information spaces. Internet news, perhaps, is the closest existing precursor to such services. It permits asynchronous updates, is replicated at thousands of autonomously managed sites, manages a large database, yet presents adequate response time. It gets its performance through massive replication. Other Internet services, such as archie, and WWW/Mosaic must massively replicate their data to deliver reasonable performance. The problem of replicating information that can be partitioned into autonomously managed subspaces has well-known solutions. Naming services such as the Domain Name Service (DNS) and Grapevine use administrative boundaries to partition their hierarchical name space into domains. These domains need only be replicated in a handful of services for adequate performance. On the other hand, efficient massive replication of wide-area, flat, yet autonomously managed data is yet to be demonstrated, since existing replication algorithms do not address the scale and autonomy of today's internetworks. They either ignore network topology issues entirely, or rely on system administrators to hand-configure replica topologies over which updates travel. Lampson's widely cited Global Name Service (GNS) propagates updates over manually configured topologies. However, the complexity of manually configuring fault-tolerant update topologies that use the underlying physical network efficiently increases with the scale of today's internetworks. Furthermore, GNS-like replication mechanisms also manage single, flat groups of replicas. While this is appropriate for applications with 20 to 30 replicas that operate within single administrative boundaries, it is unrealistic for wide-area, massively replicated services whose replicas spread throughout the Internet's thousands of administrative domains.
This research investigates scalable replication mechanisms for wide-area, autonomously managed services. We propose a new architecture that extends existing replication mechanisms to explicitly address scalability and autonomy. We target replication degrees of thousands or even tens of thousands of weakly-consistent replicas. Imitating the Internet's administrative domain routing hierarchy, the proposed architecture organizes replicas into multiple replication groups. Organizing replicas into multiple groups limits the size of the consistency state each replica keeps. It also preserves autonomy, since it insulates replication groups from other groups' administrative decisions. We argue that efficient replication algorithms keep replicas weakly consistent by flooding data between them. Our architecture automatically builds a logical update flooding topology over which replicas propagate updates to other replicas in their group. Replicas in a group estimate the underlying physical topology. Using the estimated network topology, we compute a logical update topology that is kconnected for resilience, tries to use the physical network efficiently, and yet restricts update propagation delays. Replication groups compute and adopt a new update topology every time they detect changes in group membership and network topology. We describe our architecture in detail and its implementation as a hierarchical, network cognizant replication tool, which we implemented as part of the Harvest resource discovery system. Through simulations, we investigate the benefits of using hierarchical replication groups and distributing updates over the logical update topologies our architecture computes. We present the results from the wide-area experiments we conducted to evaluate our network topology estimation strategy. We also present our logical topology calculation algorithm in detail, and the results obtained when running our algorithm over randomly generated topologies. Finally, we explore the use of internet multicast as the underlying update propagation mechanism. We argue that since having a single multicast group with all replicas of a service will not scale, each replication group can be mapped to a multicast group. Through corner replicas, updates in one group make their way to all groups. Lampson's Global Name Service has been widely accepted as the solution to the problem of replicating wide-area, distributed, weakly-consistent applications. However, almost 10 years ago, when Lampson wrote "... The system I have in mind is a large one, large enough to encompass all the computers in the world and all the people that use them...", he did not anticipate that the Internet would grow as fast as it did. The main contribution of this dissertation is to make GNS-like services work in today's exponentially-growing, autonomously-managed internetworks."
We are exploring techniques for chosing among replicated servers. Our experiments are is described in the paper:
James D. Guyton and Michael F. Schwartz. Locating Nearby Copies of Replicated Internet Servers. Technical Report CU-CS-762-95, Department of Computer Science, University of Colorado, Boulder, Colorado. Submitted for publication.
Abstract: In this paper we consider the problem of choosing among a collection of replicated servers, focusing on the question of how to make choices that segregate client/server traffic according to network topology. We explore the cost and effectiveness of a variety of approaches, ranging from those requiring routing layer support (e.g., anycast) to those that build location databases using application-level probe tools like traceroute. We uncover a number of tradeoffs between effectiveness, network cost, ease of deployment, and portability across different types of networks. We performed our experiments using a simulation parameterized by a topology collected from 7 survey sites across the United States, exploring a global collection of Network Time Protocol servers.

The Object Cache

The Object Cache subsystem is discussed in the paper:
Anawat Chankhunthod, Peter B. Danzig, Chuck Neerdaels, Michael F. Schwartz and Kurt J. Worrell. A Hierarchical Internet Object Cache. Technical Report 95-611, Computer Science Department, University of Southern California, Los Angeles, California, March 1995. Also, Technical Report CU-CS-766-95, Department of Computer Science, University of Colorado, Boulder, Colorado. Submitted for publication.
Abstract: This paper discusses the design and performance of a proxy-cache designed to make Internet information systems scale better. A hierarchical arrangement of caches mirroring the topology of a wide-area internetwork can help distribute load away from server hot spots raised by globally popular information objects, reduce access latency, and protect the network from erroneous clients. We present performance figures indicating that the cache significantly outperforms other popular Internet cache implementations at highly concurrent load and that indicate the effect of hierarchy on cache access latency. We also summarize the results of experiments comparing TTL-based consistency with an approach that fans out invalidations through the cache hierarchy. Finally, we present experiences derived from fitting the cache into the increasingly complex and operational world of Internet information systems, including issues related to security, transparency to cache-unaware clients, and the role of file systems in support of ubiquitous wide-area information systems.
Our work on the Object Cache subsystem was motivated in part by a simulation study we performed using NSFNET backbone trace data. This study is described in the paper:
Peter B. Danzig, Richard S. Hall and Michael F. Schwartz. A Case for Caching File Objects Inside Internetworks. Proceedings of SIGCOMM '93, pp. 239-248, San Francisco, California, September 1993.
Abstract: This paper presents evidence that several, judiciously placed file caches could reduce the volume of FTP traffic by 42%, and hence the volume of all NSFNET backbone traffic by 21%. In addition, if FTP client and server software automatically compressed data, this savings could increase to 27%. We believe that a hierarchical architecture of whole file caches, modeled after the existing name server's caching architecture, could become a valuable part of any internet.
We derived these conclusions by performing trace driven simulations of various file caching architectures, cache sizes, and replacement policies. We collected the traces of file transfer traffic employed in our simulations on a network that connects the NSFNET backbone to a large, regional network. This particular regional network is responsible for about 5 to 6% of NSFNET traffic.
While this paper's analysis and discussion focus on caching for FTP file transfer, the proposed caching architecture applies to caching objects from other internetwork services.
Some experimentation with a hierarchical invalidation mechanism for the Harvest Object Cache is described in the MS thesis:
Kurt Jeffery Worrell. Invalidation in Large Scale Network Object Caches. Department of Computer Science, University of Colorado, Boulder, Colorado, December 1994.
Abstract: The Internet is growing very rapidly and the growth is outstripping the capabilities of the network links and servers. One way to combat this growth is to deploy a caching system on the Internet as an additional part of the infrastructure. The system must be organized hierarchically in order to scale with the growth. A hierarchical topology for the caching system provides the opportunity to experiment with non-traditional consistency schemes. The most common consistency schem is based on per-object timeouts. These timeouts are selected based on heuristics which were developed in the past. The new growth brings not only new users but new access methods and new mechanisms for publishing and viewing data. This growth puts the old heuristics to the test. This thesis looks specifically at using invalidation protocols as the consistency mechanism for hierarchical network object caches. A study is performed to determine the upper and lower bounds on object revisions This data is used to drive a simulation comparing the timeout and the invalidation protocol as consistency mechanisms. The comparison is made in terms of network bandwidth consumed, load placed on network servers, and the percentage of time that out-of-date data is supplied to the end user. The invalidation protocol significantly reduces the percentage of cache references fulfilled with stale data while costing no more than the timeout based scheme for reasonable timeout values. A prototype design and implementation of an invalidation protocol was produced and tested on a small scale.
Some experimentation with caching protocols formed the bases for the MS thesis:
Duane Wessels. Intelligent Caching for World-Wide Web Objects. Interdisciplinary Telecommunications Program, University of Colorado - Boulder, 1995. M.S. Thesis.
Abstract: This thesis describes some software designed to improve access to World-Wide Web (WWW) data on the global Internet. The tools used for retrieving WWW objects allow users to be unaware of where the data actually resides. Huge inefficiencies result when objects are repeatedly transmitted across relatively slow wide area network (WAN) connections. A solution to this problem is to install object caches at strategic places in the network. Caches are implemented on proxy servers which act as intermediaries between local clients and remote servers. Frequently accessed objects will already be in the cache thereby speeding delivery time to clients and reducing WAN traffic.
Caching proxy servers for the World-Wide Web already exist, most notably the original CERN server. The CERN implementation leaves a lot of room for improvement. The software developed and described here is designed to give a better response to clients and impose less of a load on the server host. Internet servers maintain no state information about client accesses. This leads to an NFS-like model of caching where clients are responsible for maintaining cache consistency. This thesis investigates an AFS-like model whereby server sites issue callbacks to the sites which keep cached copies of the server's data. A cache negotiation protocol is described which gives information providers control of how, and for how long, their data may be distributed throughout remote network caches.
The following problems are addressed: The development of a single-process, non-blocking proxy server for lower system load; The maintenance of up-to-date and accurate cache data with minimal network overhead; The design of efficient and robust algorithms for cache management.

The Harvest Object System

Harvest's Object System is described in the paper:
Bhavna Chhabra, Darren R. Hardy, Allan Hundhausen, Dave Merkel, John Noble and Michael F. Schwartz. Integrating Complex Data Access Methods into the Mosaic/WWW Environment. Proceedings of the Second International World Wide Web Conference, pp. 909-919, Chicago, Illinois, October 1994.
Abstract: This paper describes object-oriented extensions to the Harvest information discovery and access system that allow users to define type hierarchies and associated access methods for Web objects. These extensions enable Mosaic and the Web to represent, manipulate and display arbitrarily complex data in application-specific ways. Types and access methods are extensible; new ones can be defined and exported from Web servers and imported into Mosaic on demand, without changes to Mosaic and HTML for each new data type. This paper describes our design and prototype system implementation, including an application we built using the system.

Return to the Harvest Home Page.