One day this spring, an HTTP request popped out the back of my old Swan 386/25, rattled through our LAN, jumped across an X.25 link to BIX, negotiated its way through three major carriers and a dozen hosts, and made a final hop over a PPP link to its rendezvous with BYTE's newborn Web server, an Alpha AXP 150 located just 2 feet from the Swan.
Thus began the project on which this column will report monthly. Its mission: To engage BYTE in direct electronic communication with the world, retool our content for digital deployment, and showcase emerging products, technologies, and ideas vital to these tasks. We don't have all the answers yet -- far from it. But we're starting to learn how a company can provide and use Internet services in a safe, effective, maintainable, and profitable way.
Cheap Thrills
The first contact with your own WWW (World Wide Web) server is an electric thrill. Experiencing that thrill gets easier every day (see "The Virtual Storefront," January BYTE, and "Build Your Own WWW Server," April BYTE).
In our case, the ingredients for our first prototype included an Alpha workstation, a U.S. Robotics Sportster 28.8 modem, Windows NT 3.5, the EMWAC (European Microsoft Windows NT Academic Consortium) Web server, and a dedicated 28.8-Kbps dial-up link to our Internet service provider, MV Communications (Manchester, NH, (603) 429-7428, info@mv.com). Here's how easy it can be:
1. You ftp the server software from its home in Edinburgh, Scotland (http://emwac.ed.ac.uk), or from one of the mirror sites that archie or a Web searcher can find for you. How? I used BIX (see the sidebar "Don't Dis the Host"), but any client-side access kit will do. Spry's Internet In A Box and IBM's Internet Access Kit for OS/2 Warp are two good ones I've used.
2. Configure NT's Remote Access Service to call the service provider and establish a PPP session, using the TCP/IP settings (i.e., IP address and subnet mask, primary and secondary DNS [Domain Naming System] servers) supplied by the provider. NT wouldn't talk to our Sportster modem at first, but a new modem initialization file from U.S. Robotics' BBS solved that. Now the Alpha had a full-time link to the Internet.
3. Fire up the Web server and point it at the root of a document collection. If you're itching for that "Hello, world" moment, a single file containing those two words will suffice. Now find a second connection to the Internet, point a browser at your site, and observe it as visitors will.
That's all there is to it. Well, not quite. Now that you can say hello to the world, what do you want to tell it? How will you get its attention? What if you get too much attention? How will you ensure the quality and consistency of the information you publish? How do you turn documents into applications that serve business goals? How do you prepare your LAN to meet the demands and mine the opportunities of business-to-business networking? Should you place bets elsewhere than the Internet--on AT&T/Novell NetWare Connect Services or AT&T/Lotus Network Notes? Beats me, but as I find answers, I'll pass them along.
Worm Bait
Fielding well-structured content on the Web is shockingly easy to do nowadays, and publishers are scrambling to figure out how to exploit this opportunity without sandbagging print and CD-ROM revenues. It's tricky, because the Internet is evolving with breathtaking speed. My original plan for the BYTE Web server, for example, was to offer navigational but not search access to the five-year, 8000-article text collection that is also navigable (and searchable) on the BYTE CD-ROM.
I figured we could add searching to the Web site in a leisurely way, after working through the security, billing, and pricing issues. But when I mentioned this to Andy Singleton, BYTE author and president of the Internet services firm Money.com (Cambridge, MA), he shot me down. "If you don't index the collection, someone else will," he said. "Two days after you open the site, the University of Washington WebCrawler will get in there and index everything it can find." He advises caching URLs (uniform resource locators) into a single file at the entrance to your site, a kind of worm bait to spare your server some of the effects of a punishing deep scan.
Yikes! Our 8000 articles form a mere drop in the vast and growing ocean of documents indexed by the Web searchers. At a concentration of a few parts per million, BYTE articles would appear only sparsely in hit lists. And that's a promotional benefit, isn't it? Sure, but if we plan to serve up content with real commercial value--and it won't be interesting unless we do--we'll likely have to regulate access in some way. That will require user-level security, some applications development, and perhaps even a secure server. For now, we'll offer just the 1994 issues.
Text Wrangling
The text stream that feeds both the Web server and the BYTE CD-ROM is plain ASCII, with simple tags marking such elements as headlines, tables, and author biographies. There are converters all over the Internet that can transform nroff, RTF (Rich Text Format), WordPerfect, TeX, and other structured formats into HTML (Hypertext Markup Language). Good lists of these are at www.yahoo.com, and www.stars.com/vlib.
But in real life, archival documents often use proprietary markup, or they must be carved into Web-efficient chunks. Thus, if your Web project involves piles of documents, you'll likely have to write some text filters.
The tool of choice for this job is the language in which most HTML converters are written--Perl (for Practical Extraction and Reporting Language). Larry Wall's language (see "Developing Applications in Perl," April 1994 BYTE) wields the powers of Unix text-processing tools--including sed, awk, and sh--in an interpreted environment that has galvanized the Unix programming community, as Visual Basic fired up Windows development. Good news spreads fast, and Perl is available on many platforms (see the University of Florida Perl archives at ftp://ftp.cis.ufl.edu).
Unfortunately, I'm not (yet) a Perl programmer, and I was in a hurry. So I reached for my favorite text-processing tool, Lugaru Software's EEL (Epsilon Extension Language), the C-oriented language embedded in Epsilon, an EMACS-like text editor for DOS, OS/2, and Unix.
The formatter I wrote in EEL transforms a file of private markup into a three-tiered HTML collection. The issue-level table of contents lists hypertext links to a series of section-level tables of contents, which in turn link to articles (see the screens). It's vital to display navigational cues in a collection of electronic documents.
The style I've adopted uses a row of standard links at the top of each page, but it varies document titles as you go up and down the tree. How do you write or modify thousands of links and titles while ensuring their integrity? You don't. This kind of HTML is like object code. Your job is to invent or adapt a compiler that can generate it.
Barrels of Browsers
Web browsers are a dime a dozen. It's a good idea to keep a variety of them on hand because each renders HTML in a slightly different way. Netscape, for example, makes the document title the window title, and Mosaic derivatives show it in a reserved field.
More important, browsers vary in how strictly they interpret HTML. When I first pointed lynx at my auto-formatted 1994 collection, the authors WERE ALL SHOUTING LIKE THIS. Why? The code that wrote the level-two headings had incorrectly paired the begin tag--<h2>--with the end tag--</h3>. Thus, the all-caps style that lynx uses for level-two heads carried on through the article.
Netscape forgives this error, but that's not necessarily a good thing if you end up creating documents that look silly in other browsers. HTML lint utilities and parsers are common on the Internet, and they're probably useful, but your first line of defense should be a good selection of browsers. NCSA (National Center for Supercomputing Applications) Mosaic derivatives, such as Spyglass's Enhanced Mosaic, interpret HTML rather strictly. By testing your collection against a range of browsers, you'll catch errors, and, just as important, you'll adapt your presentation style to the widest possible audience.
Literate Programming
Predating the WWW by a decade was Donald Knuth's own WEB. This was a toolkit that joined his typesetting language called TeX with the programming language Pascal, in the service of what Knuth called "literate programming." A literate programmer, he said, simultaneously considers the behavior of a program and the aesthetic rendering of its text and documentation. Years later, Interleaf added a Lisp engine to its publishing system and began talking about active documents that are both texts and applications (or interfaces to applications).
Today's WWW consummates the marriage of documents and programs, and it radically democratizes client/server software development. For the first time, lots of people have ready access to tools that can generate useful and inherently network-aware applications. And because browsers will reveal source code if asked, there are no secrets--HTML is an open book. You can go a long way in building a Web collection using nothing fancier than a browser. If you write links to local content in relative form (e.g., ../../bmark/bmark.htm rather than http://www.byte.com/bmark/bmark.htm), you can seamlessly intermix file-oriented access to your own site with server-based access to remote sites.
To make the documents into applications--that is, to exploit the CGI (Common Gateway Interface) through which Web documents call programs that search indexes, take credit-card numbers, and do anything else you can think of--is trickier, but this is clearly the next frontier for today's GUI application builders. I'll bet that by the time you read this in late June, components for Visual Basic, Tcl/Tk, and other RAD (rapid application development) environments will already have begun to tame the CGI for the masses.
In the Pipeline
I'm working with two Web servers. Purveyor ($1995), from Process Software, is the commercial version of the EMWAC server. It's available for Intel- and Alpha-based machines; I'm running it on the Alpha machine. It adds security features that the downloadable EMWAC server lacks. It can also run as a proxy server, connecting Web clients on BYTE's (currently unregistered) internal IP network to Web servers on the Internet.
O'Reilly & Associates' WebSite ($495) is the commercial version of another freely available server, Bob Denny's elegant WinHTTPD, a port of the NCSA server. WebSite is a Win32 program that runs on Windows 95 and NT. It sets a new standard for GUI-based Web-server administration.
By next month, www.byte.com should be open for business, so you can see how these servers perform in real life. OS/2 and Unix servers will take their turn, too, but right now I'm in a hurry to get up and running. For me, NT is the quickest way. I'll also decide which of two modes of client-side Internet access--IPX via Instant Internet (see "Short-Order Internet Access") or IP with proxy servers--gives the best balance of reliability, security, convenience, and performance.
Stick around, this is going to be fun.
WHERE TO FIND
IBM Armonk, NY (800) 342-6672 (914) 765-1900 fax: (800) 426-4329 http://www.ibm.net O'Reilly & Associates Sebastopol, CA (800) 998-9938 (707) 829-0515 http://www.ora.com Process Software Framingham, MA (800) 722-7770 (508) 879-6994 fax: (508) 879-0042 http://www.process.com Info@process.com Spry, Compuserve Internet Div. Seattle, WA (800) 557-9614 (206) 447-0300 fax: (206) 447-9008 http://www.spry.com
The BYTE CD-ROM interface is richer and more dynamic than anything the Web can offer today. But we're hedging our bets. BYTE's CD-ROM, which includes graphics, builds from the same HTML stream that feeds the BYTE Web server.
In automatic HTML, a translator converts a file of private markup into a series of issue-level tables of contents (1) linked to section-level tables of contents (2) linked finally to articles (3).