WebCrawler(TM) News
About · Web Size · Web Servers · History · Research · How it Works · The Team · Contacting Us

About WebCrawler

The WebCrawler is a web robot. It is the first product of an experiment in information discovery on the Web. I wrote it because I could never find information when I wanted it and because I do not have time to follow endless links.

The WebCrawler has three different functions:

In addition, the WebCrawler can answer some fun queries. Because it models the world using a flexible, OO approach, the actual graph structure of the Web is available for queries. This allows you, for instance, to find out which sites reference a particular page. It also lets me construct the Web Top 25 List, the list of the most frequently referenced documents that the WebCrawler as found.

How it Works

The WebCrawler works by starting with a known set of documents (even if it is just one), identifying new places to explore by looking at the outbound links from that document, and then visiting those links.

It is composed of three essential pieces:

Being a Good Citizen

The WebCrawler tries hard to be a good citizen. Its main approach involves the order in which it searches the Web. Some web robots have been known to operate in a depth-first fashion, retrieving file after file from a single site. This kind of traversal is bad. The WebCrawler searches the Web in a breadth-first fashion. When building its index of the Web, the WebCrawler will access a site at most a few times a day.

When the WebCrawler is searching for something more specific, its search may narrow to a relevant set of documents at a particular site. When this happens, the WebCrawler limits its search speed to one document per minute and sets a ceiling on the number of documents that can be retrieved from the host before query results are reported to the user. The WebCrawler also adopts several of the techniques mentioned in the Guidelines for Robot Writers.

Implementation Status

The WebCrawler is written in C and Objective-C for NEXTSTEP. It uses the WWW library from CERN, with several changes to make automation easier.


Search · Help · Facts · Top 25 Sites · Submit URLs · Random Links · No-forms Search
Copyright © 1995, America Online, Inc.

info@webcrawler.com