Frequently Asked Questions (and Answers) about Harvest

http://harvest.cs.colorado.edu/harvest/FAQ.html (World Wide Web Directory, 06/1995)

Frequently Asked Questions (and Answers) about Harvest

What is Harvest?
Where can I get more information about Harvest?
On which platforms does Harvest run?
What is the Harvest Server Registry (HSR)?
Can my Broker run on a different machine than my HTTP server?
My cache doesn't work with FTP URLs.
The Broker doesn't compile correctly.
I have a Gatherer running, now how do I run a Broker that uses it?
I'm running Harvest on HP-UX, but the 'essence' process in the Gatherer takes too much memory.
How does Harvest compare to related efforts such as WAIS, Archie, Veronica, GILS, WHOIS++, Web robots, etc.?
What future plans do you have for Harvest?
My Broker (version 1.1) keeps returning 0 results.
What about resolving hostnames on SunOS?

What is Harvest?

Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet. Moreover, Harvest makes very efficient use of network traffic, remote servers, and disk space.

Where can I get more information about Harvest?

You can learn more about and experiment with Harvest starting with the Harvest Home Page (which includes lots of information, demos).

You can also retrieve the Harvest software.

A comprehensive User's Manual is also available.

We will provide technical assistance for sites setting up Harvest servers as discussed here.

On which platforms does Harvest run?

Anyone with a World Wide Web client (e.g., NCSA Mosaic) can access and use Harvest servers. World Wide Web clients are available for most platforms, including DOS, Windows, OS/2, Macintosh, and UNIX/X-Windows. Most of these clients will work over any high-speed modem (e.g., 9600 baud or better). The WWW group at CERN maintains a list of WWW clients.

If you want to run a Harvest server, you'll need a UNIX machine. Specifically, you'll need GNU cc 2.5.8 or later running on OSF/1 2.0 or 3.0, SunOS 4.1.x, or Solaris 2.3. You will also need Perl 4.0 or 5.0 and the GNU compression programs gzip and gunzip. The source for GNU cc, GNU gzip, and Perl are available at the GNU FTP server.

At present we are concentrating our efforts on supporting OSF/1, SunOS 4.1, and Solaris 2.3. We may eventually support other operating systems (like HP-UX, Ultrix, etc.), but have no immediate plans to do so. Several outside groups are porting Harvest to other operating systems (e.g., FreeBSD), and we incorporate their changes (but cannot support these changes). To date this includes ports to AIX 3.2 using the AIX C compiler; HP-UX 09.03 using the HP ANSI C compiler A.09.69; Linux 1.1.59; and, IRIX 5.3. If you port Harvest to a new system, please notify us via email.

What is the Harvest Server Registry (HSR)?

The Harvest Server Registry provides information on available Harvest servers (including instances of Gatherers, Brokers, Object Caches, and Replication Managers). You can register your Harvest server with the HSR via a forms interface.

Can my Broker run on a different machine than my HTTP server?

Yes. Typically, the broker and httpd run on the same machine. However, the broker can run on a different machine than httpd. But, if you want users to be able to view the Broker's object files (the content summaries), then the broker's files will need to be accessible to httpd. You can NFS mount those files; or manually copy them over periodically.

My cache doesn't work with FTP URLs.

cached relies on the ftpget.pl program to retrieve FTP files and directories. Verify that ftpget.pl is in your path when you execute cached, or that cache_ftp_program is correctly set in your cached.conf file. You can verify that ftpget.pl works by running:

	% ftpget.pl - ftp.dec.com / I anonymous harvest@

The Broker doesn't compile correctly.

The Broker uses the GNU bison and flex programs (or yacc and lex) to build the grammar for the Query Manager. If you have problems compiling the Broker, then verify that you have flex v2.4.7 and bison v1.22. You can get them here:


  ftp://ftp.gnu.ai.mit.edu/pub/gnu/bison-1.22.tar.gz
  ftp://ftp.gnu.ai.mit.edu/pub/gnu/flex-2.4.7.tar.gz

Here's an example of what a compile problem might look like:

	Making all in broker
	bison -y -d query.y
	flex  query.l
	gcc  -I../common/include -I.  -target sun4 -c  lex.yy.c
	In file included from broker.h:44, from lex.yy.c:45:

	/usr/include/stdlib.h:27: conflicting types for `free'
	lex.yy.c:38: previous declaration of `free'
	/usr/include/stdlib.h:29: conflicting types for `malloc'
	lex.yy.c:37: previous declaration of `malloc'
	*** Error code 1
	make: Fatal error: Command failed for target `lex.yy.o'

I have a Gatherer running, now how do I run a Broker that uses it?

The easiest way to do this is to use the CreateBroker program. CreateBroker will ask for a collection point. Use the host and port on which your Gatherer is running for the collection point. The Broker will then collect the indexing information from your Gatherer.

See the User's Manual for more information.

Typically, you should use the RunHarvest command to create and run both the Broker and the Gatherer.

I'm running Harvest on HP-UX, but the `essence` process in the Gatherer takes too much memory.

The supplied regular expression library has memory leaks on HP-UX, so you need to use the regular expression library supplied with HP-UX. Change the Makefile in src/gatherer/essence to read:

        REGEX_DEFINE    = -DUSE_POSIX_REGEX
        REGEX_INCLUDE   = 
        REGEX_OBJ       = 
        REGEX_TYPE      = posix

My Broker (version 1.1) keeps returning 0 results.

There is a bug in glimpseserver (version 1.1) that will cause it to always return 0 results. This bug only arises when your Broker expires objects, and a full indexing has not occured recently. So, if your broker.out file contains entries where every query results in 0 objects, like the following:

 broker: 950221 15:33:47: Processing Query 862: #USER #desc #index maxresult 100 #index case sensitive #END "texas instruments"

 broker: 950221 15:33:58: Query returned 0 objects.
 broker: 950221 15:33:58: Processing Query 863: #USER #opaque #desc #index error 0 #index maxresult 50 #index case insensitive #index matchword #END Type:Font
 broker: 950221 15:34:00: Query returned 0 objects.
 broker: 950221 15:34:00: Processing Query 864: #USER #opaque #desc #index error 0 #index maxresult 50 #index case insensitive #index matchword #END nvram
 broker: 950221 15:34:02: Query returned 0 objects.

You have a few options to fix this problem:

Disable the glimpseserver by setting the GlimpseServer-Port variable to 0 (zero) in your broker.conf file; OR
Perform a Index corpus operation via the Broker's administrative interface; OR
Reduce the Broker cleaning rate by increasing the value of the Clean-Rate variable in your broker.conf; OR
Apply the following patch to the Harvest v1.1 source distribution, then rebuild and install glimpseserver which is located in the src/broker/indexing/glimpse directory.

How does Harvest compare to related efforts such as WAIS, Archie, Veronica, GILS, WHOIS++, Web robots, etc.?

Harvest provides a more scalable architecture than any of these systems, in terms of network bandwidth, server load, and disk space. In comparison with previous systems, our measurements indicate that Harvest can reduce FTP/HTTP/Gopher server load by a factor of 4 while extracting indexing information and 6,600 while delivering this information to remote indexers; network traffic by a factor of 59; and index space requirements by a factor of 43. More details are available here.
Harvest defines a structured indexing format called the Summary Object Interchange Format (SOIF) that permits structured queries (e.g., matching keywords only against author or title lines in documents). This format is more powerful than the Internet Anonymous FTP Archives IETF Working Group (IAFA) format, since SOIF permits streams of object summaries (IAFA templates hold only individual items) which in turn permit a very efficient Broker/Gatherer stream retrieval protocol, and it allows arbitrary data within the fields. SOIF's support for arbitrary data means it can be used for more complex search applications, such as image and audio searching.
Harvest provides an automated means of populating indexes with structured, high-quality indexing information. WHOIS++ depends on site administrators manually filling out IAFA templates to be indexed, while GILS does not define how index data are collected. Harvest can be used to provide the indexing data for these systems.
FreeWAIS supports only AND/OR queries against potentially structured fields. Harvest uses Glimpse as the default engine, which supports AND/OR queries, approximate searches, regular expressions, case insensitive (or sensitive) searches, the ability to match parts of words, whole words, or multiple word phrases, variable granularity result sets, and other features.
Harvest provides a flexible index/search interface that lets you plug in many search engines, including original (Thinking Machines) WAIS, Commercial WAIS, freeWAIS, Glimpse, and Nebula. Therefore, Harvest can allow users to benefit from the strengths of each engine (and can help prevent users from getting "locked" into one engine). We are currently working with several commercial search and retrieval vendors, to integrate support for their engines into Harvest. Doing so involves writing eleven "C" routines, most of which perform simple book keeping operations.
FreeWAIS and other indexers only particular indexing arrangements (full text in the case of WAIS, anchors + HTML pointers in the case of many of the Web robots). These systems deal with particular data formats using a set of hard-coded content extractors - for example, a particular piece of "C" code to extract content from bibtex, for example. In contrast, Harvest allows users to customize what information is extracted and indexed, often using standard UNIX programs (like sed) or easily written regular expressions. (freeWAIS-sf also uses a mechanism that avoids hard-coded C content extractors). Plus, Harvest provides better default summarizers in many cases (e.g., our PostScript extractor performs better than most, because it treats PostScript differently depending on if it was generated by troff, TeX, WordPerfect, etc.) Note that Harvest does support full text indexing for users who need that functionality, although our measurements indicate that Harvest's use of customized indexing achieves comparable precision and recall to that of WAIS, at only 3-11% the space requirements.
Overall, the expressive power of Harvest subsumes the other existing distributed indexing systems. As a brief indication, the following list indicates how one can use Harvest to implement some of the other well-known indexing systems:

Archie, Veronica, WWWW, etc.:
Gatherer configuration + Essence extraction script.
Content Router:
WAIS enumerator + Essence extraction script.
WAIS:
Essence full text extraction + ranking search engine.
WebCrawler:
Gatherer configuration + Essence full text extraction.
WHOIS++:
Gatherer configuration + Essence extraction script for centroids. Query front-end to search SOIF records.

There are several efforts in progress to re-implement some of the above systems on top of Harvest, to provide better performance and a simpler implementation than their original implementations provided.

The main downside to Harvest is it is more complex - more software and more choices about how to distribute things, what to customize, etc. However, you can make basic use of Harvest without understanding the complexities, by following the installation instructions in ftp://ftp.cs.colorado.edu/pub/distribs/harvest/INSTRUCTIONS.html. These instructions will walk you through the steps needed to install and make basic use of Harvest. A site with httpd already installed should be able to get Harvest installed and running in about 30 minutes, using one of the binary distributions of the software.

What future plans do you have for Harvest?

Our near-term plans include a number of directions:

support for taxonomies
providing a controlled means of "fanning out" parallel searches among distributed Brokers. This mechanism combined with taxonomies will provide a "query routing" capability
support for object-oriented access to complex data types
support in the Broker for gateways to access complex data (e.g., to talk w/ an SQL engine)
search customization via attached methods, vocabularies, and profiles
allowing tunably approximate notion of duplicate detection in the Broker
integrating non-textual index/search engines (e.g., to support audio and image searching)
running large deployment efforts (e.g., we want to index the information in the US federal government)
deploying more heavily replicated Brokers, to deal with growth in the user base
providing support for more data formats (like SGML)
"hardening" the software and improving its performance

Over the longer term, we will be taking a number of steps to allow Harvest to be used in much more powerful ways than simply as an indexing engine for informal Internet information. Our plan is to transform Harvest into an architecture that permits network-efficient interoperation among many different "component technologies", of both a free and a commercial nature. For example, Harvest might be used to gather highly structured data, store the contents into a number of different types of database systems, search and access the data through external data processing systems (e.g., to handle geo-spatial queries or audio data searches), and build authenticated services using commercial encryption technologies. Increasingly our goal will be refining the architecture and integrating support for important technologies. Note that we will continue to support the current free/informal information tools as we move to support the more powerful mechanisms outlined above.

We we would be interested in any thoughts that Harvest users have about the kinds of component technologies they would find most important. Please give us some brief background about how you would use the technologies and the size of user base you'd expect to support.

What about resolving hostnames on SunOS?

DNS

In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messaes such as ``Unknown Host.'' In this case, either

The hostname you gave does not really exist; or
Your system is not configured to use DNS.

To verify that your system is configured for DNS, make sure that the file /etc/resolv.conf exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup(8) command.

The Harvest executables for SunOS (4.1.3_U1) are statically linked with the stock resolver library from /usr/lib/libresolv.a. If you seem to have problems with the statically linked executables, please try to compile Harvest from the source code. This will make use of your local libraries which may have been modified for your particular organization.

NIS / Yellow Pages

Some sites may use NIS instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (ypwhich(1)) must be configured to query DNS servers for hostnames they do not know about. See the -b option in ypxfr(8).

We would welcome reports of Harvest successfully working with NIS. Please write to: harvest-dvl@cs.colorado.edu.

Firewalls

Harvest currently will not operate across a strict Internet firewall. Harvest can not (yet) request objects through a proxy server. You may use Harvest internally, behind a firewall, or else it can run on the firewall host itself.

If you see the ``Host is unreachable'' message, these are the likely problems:

Your connection to the Internet is temporarily down due to a circuit or routing failure.
You are behind a firewall.

If you see the ``Connection refused'' message, this is the likely problem:

You are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port.

Reporting Problems

The Harvest gatherer is essentially a WWW client. You should expect it to work the same as Mosaic, but without proxy support. We would be very interested to hear about problems with Harvest and hostnames under the following condition:

The gatherer is unable to contact host, yet you are able to use other network services (Mosaic, telnet, ping) to host without going through a proxy.

Last-Update: $Date: 1995/02/22 21:33:34 $