Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet. Moreover, Harvest makes very efficient use of network traffic, remote servers, and disk space.
You can learn more about and experiment with Harvest starting with the Harvest Home Page (which includes lots of information, demos).
You can also retrieve the Harvest software.
A comprehensive User's Manual is also available.
We will provide technical assistance for sites setting up Harvest servers as discussed here.
Anyone with a World Wide Web client (e.g., NCSA Mosaic) can access and use Harvest servers. World Wide Web clients are available for most platforms, including DOS, Windows, OS/2, Macintosh, and UNIX/X-Windows. Most of these clients will work over any high-speed modem (e.g., 9600 baud or better). The WWW group at CERN maintains a list of WWW clients.
If you want to run a Harvest server, you'll need a UNIX machine. Specifically, you'll need GNU cc 2.5.8 or later running on OSF/1 2.0 or 3.0, SunOS 4.1.x, or Solaris 2.3. You will also need Perl 4.0 or 5.0 and the GNU compression programs gzip and gunzip. The source for GNU cc, GNU gzip, and Perl are available at the GNU FTP server.
At present we are concentrating our efforts on supporting OSF/1, SunOS 4.1, and Solaris 2.3. We may eventually support other operating systems (like HP-UX, Ultrix, etc.), but have no immediate plans to do so. Several outside groups are porting Harvest to other operating systems (e.g., FreeBSD), and we incorporate their changes (but cannot support these changes). To date this includes ports to AIX 3.2 using the AIX C compiler; HP-UX 09.03 using the HP ANSI C compiler A.09.69; Linux 1.1.59; and, IRIX 5.3. If you port Harvest to a new system, please notify us via email.
The Harvest Server Registry provides information on available Harvest servers (including instances of Gatherers, Brokers, Object Caches, and Replication Managers). You can register your Harvest server with the HSR via a forms interface.
Yes. Typically, the broker and httpd run on the same machine. However, the broker can run on a different machine than httpd. But, if you want users to be able to view the Broker's object files (the content summaries), then the broker's files will need to be accessible to httpd. You can NFS mount those files; or manually copy them over periodically.
cached relies on the ftpget.pl program to retrieve FTP files and directories. Verify that ftpget.pl is in your path when you execute cached, or that cache_ftp_program is correctly set in your cached.conf file. You can verify that ftpget.pl works by running:
% ftpget.pl - ftp.dec.com / I anonymous harvest@
The Broker uses the GNU bison and flex programs (or yacc and lex) to build the grammar for the Query Manager. If you have problems compiling the Broker, then verify that you have flex v2.4.7 and bison v1.22. You can get them here:
ftp://ftp.gnu.ai.mit.edu/pub/gnu/bison-1.22.tar.gz ftp://ftp.gnu.ai.mit.edu/pub/gnu/flex-2.4.7.tar.gz
Here's an example of what a compile problem might look like:
Making all in broker bison -y -d query.y flex query.l gcc -I../common/include -I. -target sun4 -c lex.yy.c In file included from broker.h:44, from lex.yy.c:45: /usr/include/stdlib.h:27: conflicting types for `free' lex.yy.c:38: previous declaration of `free' /usr/include/stdlib.h:29: conflicting types for `malloc' lex.yy.c:37: previous declaration of `malloc' *** Error code 1 make: Fatal error: Command failed for target `lex.yy.o'
The easiest way to do this is to use the CreateBroker
program.
CreateBroker
will ask for a collection point. Use the
host and port on which your Gatherer is running for the collection point. The
Broker will then collect the indexing information from your Gatherer.
See the User's Manual for more information.
Typically, you should use the RunHarvest
command to
create and run both the Broker and the Gatherer.
essence
process in the Gatherer takes too much memory.
The supplied regular expression library has memory leaks on HP-UX,
so you need to use the regular expression library supplied with HP-UX.
Change the Makefile in src/gatherer/essence
to read:
REGEX_DEFINE = -DUSE_POSIX_REGEX REGEX_INCLUDE = REGEX_OBJ = REGEX_TYPE = posix
There is a bug in glimpseserver (version 1.1) that will cause it to always return 0 results. This bug only arises when your Broker expires objects, and a full indexing has not occured recently. So, if your broker.out file contains entries where every query results in 0 objects, like the following:
broker: 950221 15:33:47: Processing Query 862: #USER #desc #index maxresult 100 #index case sensitive #END "texas instruments" broker: 950221 15:33:58: Query returned 0 objects. broker: 950221 15:33:58: Processing Query 863: #USER #opaque #desc #index error 0 #index maxresult 50 #index case insensitive #index matchword #END Type:Font broker: 950221 15:34:00: Query returned 0 objects. broker: 950221 15:34:00: Processing Query 864: #USER #opaque #desc #index error 0 #index maxresult 50 #index case insensitive #index matchword #END nvram broker: 950221 15:34:02: Query returned 0 objects.
You have a few options to fix this problem:
glimpseserver
which is located in the src/broker/indexing/glimpse directory.
There are several efforts in progress to re-implement some of the above systems on top of Harvest, to provide better performance and a simpler implementation than their original implementations provided.
The main downside to Harvest is it is more complex - more software and more choices about how to distribute things, what to customize, etc. However, you can make basic use of Harvest without understanding the complexities, by following the installation instructions in ftp://ftp.cs.colorado.edu/pub/distribs/harvest/INSTRUCTIONS.html. These instructions will walk you through the steps needed to install and make basic use of Harvest. A site with httpd already installed should be able to get Harvest installed and running in about 30 minutes, using one of the binary distributions of the software.
Our near-term plans include a number of directions:
Over the longer term, we will be taking a number of steps to allow Harvest to be used in much more powerful ways than simply as an indexing engine for informal Internet information. Our plan is to transform Harvest into an architecture that permits network-efficient interoperation among many different "component technologies", of both a free and a commercial nature. For example, Harvest might be used to gather highly structured data, store the contents into a number of different types of database systems, search and access the data through external data processing systems (e.g., to handle geo-spatial queries or audio data searches), and build authenticated services using commercial encryption technologies. Increasingly our goal will be refining the architecture and integrating support for important technologies. Note that we will continue to support the current free/informal information tools as we move to support the more powerful mechanisms outlined above.
We we would be interested in any thoughts that Harvest users have about the kinds of component technologies they would find most important. Please give us some brief background about how you would use the technologies and the size of user base you'd expect to support.
In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messaes such as ``Unknown Host.'' In this case, either
To verify that your system is configured for DNS, make sure that the file /etc/resolv.conf exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup(8) command.
The Harvest executables for SunOS (4.1.3_U1) are statically linked with the stock resolver library from /usr/lib/libresolv.a. If you seem to have problems with the statically linked executables, please try to compile Harvest from the source code. This will make use of your local libraries which may have been modified for your particular organization.
Some sites may use NIS instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (ypwhich(1)) must be configured to query DNS servers for hostnames they do not know about. See the -b option in ypxfr(8).
We would welcome reports of Harvest successfully working with NIS. Please write to: harvest-dvl@cs.colorado.edu.
Harvest currently will not operate across a strict Internet firewall. Harvest can not (yet) request objects through a proxy server. You may use Harvest internally, behind a firewall, or else it can run on the firewall host itself.
If you see the ``Host is unreachable'' message, these are the likely problems:
If you see the ``Connection refused'' message, this is the likely problem:
The Harvest gatherer is essentially a WWW client. You should expect it to work the same as Mosaic, but without proxy support. We would be very interested to hear about problems with Harvest and hostnames under the following condition:
Last-Update: $Date: 1995/02/22 21:33:34 $