http://web.nexor.co.uk/mak/doc/robots/active.html (World Wide Web Directory, 06/1995)

A Public Service provided by NEXOR

List of Robots

This is a list of Web Wanderers. See also the World Wide Web Wanderers, Spiders and Robots page.

If you know of any that aren't on this list, please let me know.

If you're just looking for search engines, you might try CUSI.

The JumpStation Robot

Run by Jonathon Fletcher <J.Fletcher@stirling.ac.uk>.

Verion I has been in development since September 1993, and has been running on several occasions, the last run was between February the 8th and February the 21st.

More information, including access to a searcheable database with titles can be found on The Jumpstation

Identification: It runs from pentland.stir.ac.uk, has "JumpStation" in the User-agent field, and sets the From field.

Version II is under development..

Repository Based Software Engineering Project Spider

Run by Dr. David Eichmann <eichmann@rbse.jsc.nasa.gov> For more information see the Repository Based Software Engineering Project.

Consists of two parts:

Spider: A program that creates an Oracle database of the Web graph, traversing links to a specifiable depth (defaults to 5 links) beginning at a URL passed as an argument. Only URLs having ".html" suffixes or tagged as"http:" and ending in a slash are probed. Unsuccessful attempts and leaves are logged into a separate table to prevent revisiting. This is effectively then, a limited-depth breadth-first traversal of only html portions of the Web. We err on the side of missing non-obvious html documents in order to avoid stuff we're not interested in. A third table provides a list of pruning points for hierarchies to avoid because of discovered complexity, or hierarchies not wishing to be probed.
Indexer: A script that sucks html URLs out of the database and feeds them to a modified freeWAIS waisindex, which retrieves the document and indexes it. Retrieval support is provided by a front page and a cgi script driving a modified freeWAIS waissearch.

The separation of concerns is to allow spider to be a lightweight assessor of Web state, while still providing the value added to the general community of the URL search facility.

Identification: it runs from rbse.jsc.nasa.gov (192.88.42.10), requests GET /path RBSE-Spider/0.1", with a and uses a RBSE-Spider/0,1a in the User-Agent field.

Seems to retrieve documents more than once.

The WebCrawler

Run by Brian Pinkerton <bp@biotech.washington.edu>

Identification: It runs from webcrawler.cs.washington.edu, and uses WebCrawler/0.00000001 in the HTTP User-agent field.

It does a breadth-first walk, and indexes content as well as URLs etc. For more information see description.

The NorthStar Robot

Run by Fred Barrie <barrie@unr.edu> and Billy Barron.

More information including a search interface is available on the NorthStar Database. Recent runs (26 April) will concentrate on textual analysis of the Web versus GopherSpace (from the Veronica data) as well as indexing.

Run from frognot.utdallas.edu, possibly other sites in utdallas.edu, and from cnidir.org. Now uses HTTP From fields, and sets User-agent to NorthStar

W4 (the World Wide Web Wanderer)

Run by Matthew Gray <mkgray@mit.edu>

Run initially in June 1993, its aim is to measure the growth in the web. See details and the list of servers

User-agent: WWWWanderer v3.0 by Matthew Gray <mkgray@mit.edu>

The fish Search

Run by people using the version of Mosaic modified by Paul De Bra <debra@win.tue.nl>

It is a spider built into Mosaic. There is some documentation online.

Identification: Modifies the HTTP User-agent field. (Awaiting details)

The Python Robot

Written by Guido van Rossum <Guido.van.Rossum@cwi.nl>

Written in Python. See the overview.

html_analyzer-0.02

Run by James E. Pitkow <pitkow@aries.colorado.edu>

Its aim is to check validity of Web servers. I'm not sure if it has ever been run remotely.

MOMspider

Written by Roy Fielding <fielding@ics.uci.edu>

Its aim is to assist maintenance of distributed infostructures (HTML webs). It has its own page.

HTMLgobble

Maintained by Andreas Ley <ley@rz.uni-karlsruhe.de>

A mirroring robot. Configured to stay within a directory, sleeps between requests, and the next version will use HEAD to check if the entire document needs to be retrieved.

Identification: Uses User-Agent: HTMLgobble v2.2, and it sets the From field. Usually run by the author, from tp70.rz.uni-karlsruhe.de.

Now source is available (but unmaintained).

WWWW - the WORLD WIDE WEB WORM

Maintained by Oliver McBryan <mcbryan@piper.cs.colorado.edu>.

Another indexing robot, for which more information is available. Actually has quite flexible search options.

Awaiting identification information (run from piper.cs.colorado.edu?).

WM32 Robot

Run by Christophe Tronche <Christophe.Tronche@lri.fr>

It has its own page. Supposed to be compliant with the proposed standard for robot exclusion.

Identification: run from hp20.lri.fr, User-Agent W3M2/0.02 and From field is set.

Websnarf

Maintained by Charlie Stross <charless@sco.com>

A WWW mirror designed for off-line browsing of sections of the web.

Identification: run from ruddles.london.sco.com.

The Webfoot Robot

Run by Lee McLoughlin <L.McLoughlin@doc.ic.ac.uk>

First spotted in Mid February 1994.

Identification: It runs from phoenix.doc.ic.ac.uk

Further information unavailable.

Lycos

Owned by Dr. Michael L. Mauldin <fuzzy@cmu.edu> at Carnegie Mellon University.

This is a research program in providing information retrieval and discovery in the WWW, using a finite memory model of the web to guide intelligent, directed searches for specific information needs.

You can search the Lycos database of WWW documents, which currently has information about 390,000 documents in 87 megabytes of summaries and pointers.

More information is available on its home page.

Identification: User-agent "Lycos/x.x", run from fuzine.mt.cs.cmu.edu. Lycos also complies with the latest robot exclusion standard.

ASpider (Associative Spider)

Written and run by Fred Johansen <fred@nvg.unit.no>

Currently under construction, this spider is a CGI script that searches the web for keywords given by the user through a form.

Identification: User-Agent: "ASpider/0.09", with a From field "fredj@nova.pvv.unit.no".

SG-Scout

Introduced by Peter Beebee <ptbb@ai.mit.edu, beebee@parc.xerox.com>

Run since 27 June 1994, for an internal XEROX research project, with some information being made available on SG-Scout's home page

Does a "server-oriented" breadth-first search in a round-robin fashion, with multiple processes.

Identification: User-Agent: "SG-Scout", with a From field set to the operator. Complies with standard Robot Exclusion. Run from beta.xerox.com.

EIT Link Verifier Robot

Written by Jim McGuire <mcguire@eit.COM>

Announced on 12 July 1994, see their page.

Combination of an HTML form and a CGI script that verifies links from a given starting point (with some controls to prevent it going off-site or limitless).

Seems to run at full speed...

Identification: version 0.1 sets no User-Agent or From field. From version 0.2 up the User-Agent is set to "EIT-Link-Verifier-Robot/0.2". Can be run by anyone from anywhere.

ANL/MCS/SIGGRAPH/VROOM Walker

Owned/Maintained by Bob Olson <olson@mcs.anl.gov>

This robot is gathering data to do a full-text glimpse and provide a Web interface for it. The index, and further information, will appear on ANL's server.

Identification: sets User-agent to "ANL/MCS/SIGGRAPH/VROOM Walker", and From to "olson.anl.gov".

Now follows the exclusion protocol, and doesn't perform rapid fire searches.

WebLinker

Written and run by James Casey <casey@ptsun00.cern.ch>

It is a tool called 'WebLinker' which traverses a section of web, doing URN->URL conversion. It will be used as a post-processing tool on documents created by automatic converters such as LaTeX2HTML or WebMaker. More information is on its home page.

At the moment it works at full speed, but is restricted to local sites. External GETs will be added, but these will be running slowly.

WebLinker is meant to be run locally, so if you see it elsewhere let the author know!

Identification: User-agent is set to 'WebLinker/0.0 libwww-perl/0.1'.

Emacs w3-search

Written by William M. Perry <wmperry@spry.com>

This is part of the w3 browser mode for Emacs, and half implements a client-side search for use in batch processing. There is no interactive access to it.

For more info see the Searching section in the Emacs-w3 User's Manual.

I don't know if this is ever actually used by anyone...

Arachnophilia

Run by Vince Taluskie <taluskie@utpapa.ph.utexas.edu>

The purpose (undertaken by HaL Software) of this run was to collect approximately 10k html documents for testing automatic abstract generation. This program will honor the robot exclusion standard and wait 1 minute in between requests to a given server.

Identification: Sets User-agent to 'Arachnophilia', runs from halsoft.com.

Mac WWWWorm

Written by Sebastien Lemieux <lemieuse@ERE.UMontreal.CA>

This is a French Keyword-searching robot for the Mac, written in HyperCard. The author has decided not to release this robot to the public.

Awaiting identification details.

churl

Maintained by Justin Yunke <yunke@umich.edu>

A URL checking robot, which stays within one step of the local server, see further information.

Awaiting identification details.

tarspider

Run by Olaf Schreck <chakl@fu-berlin.de> (Can be fingered at chakl@bragg.chemie.fu-berlin.de or olafabbe@w255zrz.zrz.tu-berlin.de)

Sets User-Agent to "tarspider <version>", and From to "chakl@fu-berlin.de".

The Peregrinator

Run by Jim Richardson <jimr@maths.su.oz.au>.

This robot, in Perl V4, commenced operation in August 1994 and is being used to generate an index called MathSearch of documents on Web sites connected with mathematics and statistics. It ignores off-site links, so does not stray from a list of servers specified initially.

Identification: The current version sets User-Agent to Peregrinator-Mathematics/0.7. It also sets the From field.

The robot follows the exclusion standard, and accesses any given server no more often than once every several minutes.

A description of the robot is available.

checkbot.pl

Written by Dimitri Tischenko <D.B.Tischenko@TWI.TUDelft.NL>

Another validation robot.

Sets User-agent to 'checkbot.pl/x.x libwww-perl/x.x' and sets the From field.

web-walk

Written by Rich Testardi <rpt@fc.hp.com>

A world-wide-web maintenance robot.

Sets User-agent and the From field.

Harvest

Run by hardy@bruno.cs.colorado.edu

A Resource Discovery Robot, part of the Harvest Project.

Runs from bruno.cs.colorado.edu, sets User-agent and From fields.

Pauses 1 second between requests (by default).

Note that Harvest's motivation is to index community- or topic- specific collections, rather than to locate and index all HTML objects that can be found. Also, Harvest allows users to control the enumeration several ways, including stop lists and depth and count limits. Therefore, Harvest provides a much more controlled way of indexing the Web than is typical of robots.

Katipo

Run by Michael Newbery <Michael.Newbery@vuw.ac.nz>

I've written a robot to retrieve specific WWW pages. A Mac WWW robot that periodically (typically, once a day), walks through the global history file produced by some Browsers (NetScape and NCSA Mosaic for example), checking for documents that have changed since last visited. See its information page.

The Robot is called Katipo and identifies itself as User-Agent: Katipo/1.0 From: Michael.Newbery@vuw.ac.nz It emits _only_ HEAD queries, and prefers to work through proxy (caching) servers.

InfoSeek Robot 1.0

By Steve Kirsch <stk@infoseek.com>

Its purpose is to collect information to use in a "WWW Pages" database in InfoSeek's information retrieval service (for more information on InfoSeek, please send a blank e-mail to info@infoseek.com).

The Robot follows all the guidelines listed in "Guidelines for Robot Writers" and we try to run it on off hours.

We will be updating the WWW database daily with new pages and re-load from scratch no more frequently than once per month (probably even longer). Most sites won't get more than 20 requests a month from us since there are only about 100,000 pages in the database.

GetURL

Written and maintained by James Burton <burton@cs.latrobe.edu.au>.

A robot written in REXX, which downloads a hierarchy of documents with a breadth-first search.

Example usage:

geturl http://info.cern.ch/ -recursive -host info.cern.ch -path /hypertext/#?

would restrict the search to the specified host and path.

Source and documentation are available.

The Use-Agent field is set to 'GetUrl.rexx v1.0 by burton@cs.latrobe.edu.au'

Open Text Corporation Robot

Run by Tim Bray <tbray@opentext.com&rt;

Sets User-agent to 'OMW/0.1 libwww/217'

Follows robot exclusion rules, and shouldn't visit any host more than once in 5 minutes.

The TkWWW Robot

Implemented by Scott Spetka <scott@cs.sunyit.edu>

The TkWWW Robot is described in a paper presented at the WWW94 Conference in Chicago. It is designed to search Web neighborhoods to find pages that may be logically related. The Robot returns a list of links that looks like a hot list. The search can be by key word or all links at a distance of one or two hops may be returned.

For more information see The TkWWW Home Page.

Martijn Koster