If you know of any that aren't on this list, please let me know.
If you're just looking for search engines, you might try CUSI.
Run by Jonathon Fletcher <J.Fletcher@stirling.ac.uk>.
Verion I has been in development since September 1993, and has been running on several occasions, the last run was between February the 8th and February the 21st.
More information, including access to a searcheable database with titles can be found on The Jumpstation
Identification: It runs from
pentland.stir.ac.uk
,
has "JumpStation" in the User-agent field, and
sets the From field.
Version II is under development..
Run by Dr. David Eichmann <eichmann@rbse.jsc.nasa.gov> For more information see the Repository Based Software Engineering Project.
Consists of two parts:
Identification: it runs from rbse.jsc.nasa.gov
(192.88.42.10)
, requests
GET /path RBSE-Spider/0.1", with a
and uses a RBSE-Spider/0,1a
in the
User-Agent
field.
Seems to retrieve documents more than once.
Identification: It runs from
webcrawler.cs.washington.edu
, and uses WebCrawler/0.00000001
in the
HTTP User-agent field.
It does a breadth-first walk, and indexes content as well as URLs etc. For more information see description.
More information including a search interface is available on the NorthStar Database. Recent runs (26 April) will concentrate on textual analysis of the Web versus GopherSpace (from the Veronica data) as well as indexing.
Run from frognot.utdallas.edu
, possibly other sites
in utdallas.edu
, and from cnidir.org
.
Now uses HTTP From fields, and sets User-agent to NorthStar
Run initially in June 1993, its aim is to measure the growth in the web. See details and the list of servers
User-agent: WWWWanderer v3.0 by Matthew Gray <mkgray@mit.edu>
It is a spider built into Mosaic. There is some documentation online.
Identification: Modifies the HTTP User-agent field. (Awaiting details)
Written in Python. See the overview.
Its aim is to check validity of Web servers. I'm not sure if it has ever been run remotely.
Its aim is to assist maintenance of distributed infostructures (HTML webs). It has its own page.
A mirroring robot. Configured to stay within a directory, sleeps between requests, and the next version will use HEAD to check if the entire document needs to be retrieved.
Identification: Uses User-Agent: HTMLgobble v2.2
,
and it sets the From field
. Usually run by the
author, from tp70.rz.uni-karlsruhe.de
.
Now source is available (but unmaintained).
Another indexing robot, for which more information is available. Actually has quite flexible search options.
Awaiting identification information (run from piper.cs.colorado.edu?).
It has its own page. Supposed to be compliant with the proposed standard for robot exclusion.
Identification: run from hp20.lri.fr
,
User-Agent W3M2/0.02
and
From field is set.
A WWW mirror designed for off-line browsing of sections of the web.
Identification: run from ruddles.london.sco.com
.
First spotted in Mid February 1994.
Identification: It runs from phoenix.doc.ic.ac.uk
Further information unavailable.
This is a research program in providing information retrieval and discovery in the WWW, using a finite memory model of the web to guide intelligent, directed searches for specific information needs.
You can search the Lycos database of WWW documents, which currently has information about 390,000 documents in 87 megabytes of summaries and pointers.
More information is available on its home page.
Identification: User-agent "Lycos/x.x", run from
fuzine.mt.cs.cmu.edu
. Lycos also
complies with the latest robot exclusion standard.
Currently under construction, this spider is a CGI script that searches the web for keywords given by the user through a form.
Identification: User-Agent: "ASpider/0.09", with a From field "fredj@nova.pvv.unit.no".
Run since 27 June 1994, for an internal XEROX research project, with some information being made available on SG-Scout's home page
Does a "server-oriented" breadth-first search in a round-robin fashion, with multiple processes.
Identification: User-Agent: "SG-Scout", with a From field set to the operator. Complies with standard Robot Exclusion. Run from beta.xerox.com.
Announced on 12 July 1994, see their page.
Combination of an HTML form and a CGI script that verifies links from a given starting point (with some controls to prevent it going off-site or limitless).
Seems to run at full speed...
Identification: version 0.1 sets no User-Agent or From field. From version 0.2 up the User-Agent is set to "EIT-Link-Verifier-Robot/0.2". Can be run by anyone from anywhere.
Owned/Maintained by Bob Olson <olson@mcs.anl.gov>
This robot is gathering data to do a full-text glimpse and provide a Web interface for it. The index, and further information, will appear on ANL's server.
Identification: sets User-agent to "ANL/MCS/SIGGRAPH/VROOM Walker", and From to "olson.anl.gov".
Now follows the exclusion protocol, and doesn't perform rapid fire searches.
It is a tool called 'WebLinker' which traverses a section of web, doing URN->URL conversion. It will be used as a post-processing tool on documents created by automatic converters such as LaTeX2HTML or WebMaker. More information is on its home page.
At the moment it works at full speed, but is restricted to local sites. External GETs will be added, but these will be running slowly.
WebLinker is meant to be run locally, so if you see it elsewhere let the author know!
Identification: User-agent is set to 'WebLinker/0.0 libwww-perl/0.1'.
This is part of the w3 browser mode for Emacs, and half implements a client-side search for use in batch processing. There is no interactive access to it.
For more info see the Searching section in the Emacs-w3 User's Manual.
I don't know if this is ever actually used by anyone...
The purpose (undertaken by HaL Software) of this run was to collect approximately 10k html documents for testing automatic abstract generation. This program will honor the robot exclusion standard and wait 1 minute in between requests to a given server.
Identification: Sets User-agent to 'Arachnophilia', runs from halsoft.com.
This is a French Keyword-searching robot for the Mac, written in HyperCard. The author has decided not to release this robot to the public.
Awaiting identification details.
A URL checking robot, which stays within one step of the local server, see further information.
Awaiting identification details.
Sets User-Agent to "tarspider <version>", and From to "chakl@fu-berlin.de".
This robot, in Perl V4, commenced operation in August 1994 and is being used to generate an index called MathSearch of documents on Web sites connected with mathematics and statistics. It ignores off-site links, so does not stray from a list of servers specified initially.
Identification: The current version sets User-Agent
to
Peregrinator-Mathematics/0.7
.
It also sets the From
field.
The robot follows the exclusion standard, and accesses any given server no more often than once every several minutes.
A description of the robot is available.
Another validation robot.
Sets User-agent to 'checkbot.pl/x.x libwww-perl/x.x' and sets the From field.
A world-wide-web maintenance robot.
Sets User-agent and the From field.
A Resource Discovery Robot, part of the Harvest Project.
Runs from bruno.cs.colorado.edu
,
sets User-agent and From fields.
Pauses 1 second between requests (by default).
Note that Harvest's motivation is to index community- or topic- specific collections, rather than to locate and index all HTML objects that can be found. Also, Harvest allows users to control the enumeration several ways, including stop lists and depth and count limits. Therefore, Harvest provides a much more controlled way of indexing the Web than is typical of robots.
I've written a robot to retrieve specific WWW pages. A Mac WWW robot that periodically (typically, once a day), walks through the global history file produced by some Browsers (NetScape and NCSA Mosaic for example), checking for documents that have changed since last visited. See its information page.
The Robot is called Katipo and identifies itself as User-Agent: Katipo/1.0 From: Michael.Newbery@vuw.ac.nz It emits _only_ HEAD queries, and prefers to work through proxy (caching) servers.
Its purpose is to collect information to use in a "WWW Pages" database in InfoSeek's information retrieval service (for more information on InfoSeek, please send a blank e-mail to info@infoseek.com).
The Robot follows all the guidelines listed in "Guidelines for Robot Writers" and we try to run it on off hours.
We will be updating the WWW database daily with new pages and re-load from scratch no more frequently than once per month (probably even longer). Most sites won't get more than 20 requests a month from us since there are only about 100,000 pages in the database.
A robot written in REXX, which downloads a hierarchy of documents with a breadth-first search.
Example usage:
geturl http://info.cern.ch/ -recursive -host info.cern.ch -path /hypertext/#?would restrict the search to the specified host and path.
Source and documentation are available.
The Use-Agent field is set to 'GetUrl.rexx v1.0 by burton@cs.latrobe.edu.au'
Sets User-agent to 'OMW/0.1 libwww/217'
Follows robot exclusion rules, and shouldn't visit any host more than once in 5 minutes.
The TkWWW Robot is described in a paper presented at the WWW94 Conference in Chicago. It is designed to search Web neighborhoods to find pages that may be logically related. The Robot returns a list of links that looks like a hot list. The search can be by key word or all links at a distance of one or two hops may be returned.
For more information see The TkWWW Home Page.