Web Search

It's easy to index a Web document collection so visitors can search your site. Here are a couple of ways to do it.

Jon Udell

Think the Web is too vast to search? I did, but index-and-search engines such as Carnegie Mellon University's Lycos (http://lycos.cs.cmu.edu/) and the University of Washington's WebCrawler (http://webcrawler.com/) proved me wrong. These robotic indexers ceaselessly read and catalog Web pages, and they have so far kept up remarkably well with the Web's explosive growth.

They take simple queries -- a single term or several ANDed together -- and return lists of URLs that could otherwise take you months of point-and-click navigation to assemble. They typically don't do proximity searches (word 1 within so many words of word 2). But InfoSeek (http://www.infoseek.com), a commercial service, does. And the WAIS Inc. server (http://www .wais.com) that powers a number of Web sites can even handle natural-language queries like What is the capital of Peru? For an example, see Encyclopedia Britannica at http://www.eb.com

As good as Web search tools are, when you ask a specific question -- How do I walk a directory tree in Perl? or What's the cheapest laser printer with network support for IP, IPX, and AppleTalk? -- you likely won't find an answer in a hurry, and you may not find one at all. Brute-force searching, even at its best, yields hordes of false positives -- documents that contain the keywords, even perhaps in close proximity, but have nothing to do with the question.

Information providers can help by categorizing documents, so users can look for how-to articles on Perl or reviews of network-ready laser printers. As we move BYTE's content into electronic media, we'll try to provide such clues. But will our categories match those used by other computer magazines? By book publishers? By people who post to Internet newsgroups? As the Web absorbs and extends the world's libraries, authors and editors will find that proper classification of their contributions to the Web will make those documents easier to find and, hence, more valuable. My advice to major Web contributors (and to creators of Web authoring tools) is to hire a library scientist.

Basic Indexing

While you're waiting for a Web equivalent of the Dewey decimal system, you might as well go ahead and add basic indexing to your Web site. Because we're running Windows NT, the EMWAC (European Microsoft Windows NT Academic Consortium) WAIS (Wide Area Information Servers) server and toolkit were the logical place to start. These tools are NT ports of freeWAIS; you can get Intel, Mips, and Alpha versions of them from various places including Microsoft (the Windows NT Resource Kit CD), EMWAC (http://emwac.ed.ac.uk or its mirror sites), and Process Software (http://www.process.com). Versions of freeWAIS for many Unixes are available from CNIDR (the Clearinghouse for Networked Information Discovery and Retrieval) at ftp://ftp.cnidr.org

The tools come in two packages: wsXXX.zip for the WAIS server and wtXXX.zip for the WAIS toolkit. (Replace XXX with your CPU: i386, Mips, or Alpha.) I grabbed both programs from Process Software's site, thinking that I'd need the toolkit to create indexes and the server to access them. As it turned out, I really needed only the toolkit. It contains both the indexer and a query tool that searches an index and returns an HTML-formatted report listing URLs for documents matching the search. What's the WAIS server for? It enables WAIS clients to bypass your Web server and access your indexed document collection directly, using the WAIS Z39.50 protocol (see the sidebar "What About WAIS?").

You'll need long filenames to use waisindex, the tool that does the indexing. Prior to NT 3.5, that meant you had to run it on an NTFS volume, but now that NT 3.5's FAT (file allocation table) supports long names that's no longer a problem. Here's the command I used to index the January 1994 issue of BYTE:

waisindex -d index -r -a -T html art\9401\*.htm

where -d names the index, -r tells the indexer to recursively index subdirectories, and -a appends to an existing index. First time through I skipped the -T html option. When I searched the resulting index, what came back were filenames, not document titles. That meant the search results were cryptic references like "art\9401\sec9\art7.htm" instead of more helpful ones like "January 1994 / Reviews / Low-Cost Laser Printers."

Since the translator that creates our HTML files writes the latter style of reference in the <title> field of every article's HTML header, adding -T html was the quick fix. However, it prompted me to reconsider my sequentially generated URLs (see the sidebar "8.3 Brain Damage").

Once you've got the index built, it's a snap to connect Web clients to it. If you created an index named "index," you can create a form enabling users to search it by simply writing the keyword <isindex> in an HTML document called INDEX.HTM. When viewed in a browser, this document displays the familiar search form "This is a searchable index. Enter search keywords." When the user enters a search term, the Web server passes it to waislook, a program that searches the index and returns HTML-formatted results.

On a pair of NT boxes running EMWAC-derived Web servers -- a 486 with Folio's Infobase Web Server, and an Alpha with Process Software's Purveyor -- these procedures yielded the searchable archive that I'm currently testing. It works, but since multiple search terms combine with OR rather than AND, and there's no phrase search ("SQL catalog") or proximity search ("SQL within/5 catalog"), you depend on the selective power of a single term. An unusual one, like "PnP" or "Z39.50," will often net just the right bunch of articles; that's what makes even this bare-bones indexing system incredibly useful. But it's really just a minimal solution.

WebSite and SWISH

To improve matters on the 486 server, I turned to WebIndex, the tool that comes with O'Reilly & Associates' WebSite server for NT and Win 95. You launch WebIndex from WebSite's GUI administration tool, and it prompts you graphically for URLs to include in the index and begins indexing with a mouse click. Unlike waislook, WebSite's WebFind can at least join terms together with AND so that when you use multiple terms, the result set will shrink rather than grow. For small collections, it's just what it claims to be: a one-button indexer for non-nerds. But when I fed it several thousand documents, hours of disk thrashing ensued until I killed it.

What remained, from a previous run on fewer documents, was a file called index.swish. Swish? That's just the sort of oddball search term that gets great results on the Internet. A WebCrawler search led me to Enterprise Integration Technologies and the Simple Web Indexing System for Humans, which is an alternative to freeWAIS. O'Reilly's WebIndex derives from version 1.0 of EIT's SWISH. I downloaded SWISH 1.1 from ftp://ftp.eit.com, compiled it on our BSDI 2.0 machine, and tried it. SWISH is tuned for HTML -- e.g., it can index just fields tagged as titles or comments. It can also segment a large indexing job into many small ones, then merge the segments.

Since low memory was the likely cause of disk thrashing, I thought I'd try the merge option described in the SWISH 1.1 docs. No luck. WebIndex is a pure GUI tool that doesn't expose that feature.

O'Reilly put me in touch with EIT's director of Web publishing, Jay Weber. I ftp'd the archive to EIT, where Weber successfully indexed it with WebIndex on several test systems. EIT also added a progress meter to WebIndex that revealed speedy progress through 10 of 14 BYTE issues, then suddenly--molasses. Weber sent me a new, memory-optimized update (available at http://website.ora.com or ftp://ftp.eit.com/pub/website). It did work with my data.

The Folio Alternative

Folio's Infobase Web Server is a completely different way to serve an indexed collection to the Web. It's an EMWAC-based Web server mated to the Folio Views search engine. That means it does everything that normal Web servers do, and it can also convert existing Views infobases to HTML on the fly. If you have infobases on hand, this is just the ticket. Even if you don't, this approach has a lot going for it. Views has a lightning-fast indexer, handles huge data sets, deals with hierarchical documents, and does phrase and proximity searches.

If you're a Views user, you can judge for yourself how well this Web converter reproduces Folio's Windows user interface. And while a series of retrieved Web pages clearly can't be as richly interactive or as responsive as a native application, this technique does inject client/server capability into Folio Views.

Visitors to the BYTE Web site have been trying all three search mechanisms. Folio and WebIndex are more popular than the less-capable freeWAIS, but freeWAIS is faster for single-term queries. Because effective use of the Web requires searching, I'll continue to explore these types of tools.

Toolwatch

Diskeeper for Windows NT,
   Workstation, $199
   Server, $399
Executive Software
(818) 547-2050
http://www.earthlink.net/execsoft/

You can't shut down a busy Web server to defragment its disk. Here's
the answer: an elegant defragging tool that runs on a background
thread and can even schedule itself for periodic execution.

Booknote

Internetworking with TCP/IP, 3rd edition, $52
   by Douglas Comer
   Prentice-Hall, 1995
   ISBN 0-13-216987-8

Updated with new material on security, IPng, and ATM, Comer's lucid
tutorial on Internet plumbing continues to top the charts.

Jon Udell (judell@bix.com) is BYTE's executive editor for new media.