Damn Lies

Hit counts don't mean a thing. Here's how to analyze your Web server's logs to understand the size, composition, and behavior of your site's audience.

Jon Udell

There are lies, damn lies, and Web server statistics. The most common metric -- hit counts -- means nothing, as I proved to myself on September 10th when I added a standard row of icons to every page of the BYTE archive. The server log for September 9th recorded 8000 hits. On the 11th, the number nearly tripled -- to 21,000 hits. Was the new iconic interface attracting that much interest? Nope. Real usage of the site hadn't changed: On the 9th, the server delivered 1500 articles from the archive to 1300 IP addresses, and on the 11th, 1400 articles to 1350 IP addresses.

Why the apparent spike? When you fetch a page with, say, five icons, the server logs six hits -- one for the text of the page and one for each icon. Toss in some gratuitous transparent GIFs and you can inflate your server's hit count as much as you want. That's a silly form of job security for a webmaster, of course; the real trick is to define, and then quantify, what constitutes real usage of your site. Let's look at the tools and techniques you'll need to do that.

Mining the Server Log

The de facto standard log format, pioneered by the NCSA server, contains for each hit a record of the date, the time, the URL fetched, and the IP address that fetched it. This raw data piles up quickly. The BYTE Site now cranks out almost 5 MB of this stuff every day. To begin refining it, you'll need a tool that summarizes hits by IP address and by page. Perl was born to do just this kind of reporting. Roy Fielding's wwwstat (http://www.ics.uci.edu/WebSoft/wwwstat/), a Perl script that boils down server logs, creates many of the statistics reports that you see on the Web. I run it every day on our log.

You'll want a version of wwwstat that converts IP addresses (199.125.99.2) to Domain Naming System (DNS) names (www.byte.com). Web servers record dotted numeric addresses unless you configure them to do reverse DNS lookups on incoming addresses. Servers typically offer this reverse-lookup feature but also discourage its use. Why? The gethostbyaddr function, which hands a numeric address to a name server and gets back a name, works slowly and unpredictably. Meanwhile, the client waits. So it's best to pump raw addresses into the log and convert them later.

The wwwstat I tried first didn't perform this conversion, so I wrote a Perl script (see "Reverse DNS Lookup in Perl") to do the job. Later I found that the NT version of wwwstat that comes with Process Software's Purveyor includes essentially the same code and also caches results to avoid redundant lookups.

The wwwstat code reverses the elements of each DNS name (www.byte.com, for example, becomes com.byte.www) so that sorted lists of these names cluster by domain. It also maps domains (such as au and np) to countries (Australia, Nepal). We all know the Web is global, but it didn't really sink in for me until wwwstat mapped the geographic diversity of the BYTE Site's audience. The reversed names also help sort out usage within domains. Digital Equipment, IBM, and Oracle top the list of commercial users of the BYTE Site.

Mapping the Archive

Reports generated by wwwstat were a good start, but I wanted to know more, so I wrote some Perl scripts to further analyze the wwwstat output. Mapping the URLs in the log to titles of BYTE articles was the first task. Alert readers may remember that I promised to make my HTML generator produce meaningful URLs like /November_1995/Reviews/Enterprise_Data_Managers rather than cryptic ones like /9511/sec9/art2.htm. Well, I still haven't gotten around to it -- nobody's perfect -- so my HTML generator instead writes a file of mappings between URLs and titles. A Perl script merges this file with wwwstat's mapping between URLs and hits in order to relate titles to hits (see "Tracking Usage of the BYTE Archive").

As the script parses the wwwstat output, it uses a regular expression to filter out all nonarchive documents, and also all archive documents that aren't BYTE articles -- for example, table of contents pages. I'm not interested in tracking the use of our site's navigational armature; I want to focus on the use of its content, and for us the relevant unit of content is the article.

The script also builds views by issue (e.g., November 1995) and by section (e.g., Reviews). Just as wwwstat shows us the global composition of our audience, these views show us as never before how that audience uses our content.

Each site has its own fundamental unit of content and method of organization. So you likely won't find an off-the-shelf tool that delivers this level of analysis. Fortunately, Perl's regular-expression-searching and dynamic associative arrays make quick work of custom log analysis.

IP Addresses and Users

How large is the BYTE Site's audience? Because wwwstat doesn't tally the IP addresses in the log, I wrote another Perl script to read wwwstat's output and add them up. Currently each day's log contains about 2000 IP addresses. Does that correspond to 2000 users? No, because some addresses (e.g., compuserve.com, oracle.com) represent hundreds or thousands of users.

Without mandatory registration, you cannot precisely quantify the number of users, but you can make an estimate. First you separate the DNS names found in one day's log into two groups: those that are clearly gateways to corporate networks or on-line services and those that aren't. Compute the average number of hits per IP address for the latter group of (presumably) individual users. Then use that ratio to estimate the number of users behind each of the corporate addresses in the former group. I did this with a few daily log files and concluded that the BYTE Site's user population exceeds its number of visiting IP addresses by 10 to 15 percent.

You'll want to measure not only the size of your audience but also its rate of growth. To do that, I wrote another script that reads the daily log files and reports three values for each: the number of unique IP addresses for that day, the cumulative number of unique addresses since the launch of the site, and the number of new addresses for that day (the difference between the first two). The results were shocking. On any given day, more than half the IP addresses in the BYTE Site's log represent new visitors, a pattern that's held constant through the five months of the site's existence. When I began writing this article on November 11th, 80,000 IP addresses had visited the BYTE Site. Today, December 7th, I'm proofreading this article and the number stands at 110,000. I never would have thought that one server on a puny 56-Kbps line could handle so many users. A little data mining proves that it can.

Page Tracking vs. Link Tracking

Server logs record the documents that users fetch but say nothing about how users interact with those documents. From the Resources page on the BYTE Site, for example, users can jump to any of the dozens of sites listed there. Measuring how often users follow such links is necessary for a commercial site, where the links correspond to ads. But it's useful for noncommercial sites, too. The more you know about the kinds of information your users want, the more effectively you can configure your site to provide it.

To track links, replace standard URLs with CGI URLs that call a link-tracking script. The script can append information to a custom log file, then return a redirection to the original URL. On the BYTE Site's Resources page, for example, a link of the form

<a href="http://web.archive.org/web/199603/http://www.compuserve.com>compuserve</a>%3C/tt">

becomes

<a href="[unarchived-link]">

When a user clicks on this link, loglink.pl opens a file called CompuServe and records in it these items of information about the event: the date, the time, the user's IP address (found in the CGI variable REMOTE_ADDR), and the user's Web browser (the CGI variable USER_AGENT). Then, instead of returning the standard HTTP content header ("Content: text/html\n\n") and an HTML document, it returns this location header:

Location: http://www.compuserve.com

which redirects the browser to that URL. From the user's perspective, the link to CompuServe is immediate. But the quick detour through loglink.pl stores useful information that otherwise wouldn't be available.

Counting Browsers

Once you're set up to track links, you can conduct a browser census. The USER_AGENT variable identifies the user's browser and (directly or indirectly) the OS under which it runs. The "Site Browser and Platform Summaries" reports on the browser population that has visited our site's Resources page. Where do these tables come from? A Perl script reads the USER_AGENT field in the link-tracking log and applies simple heuristics: Mozilla is the reported name of the Netscape browser, WebExplorer runs on OS/2, X11 implies Unix.

As the HTML standard continues to fragment, a browser summary can help you decide how much of your audience will be inconvenienced by the use of nonstandard HTML extensions. Remember, though, that even a small percentage of a large audience represents a lot of users. Lulled by Lynx's negligible 3 percent of the BYTE Site's browser population, I grew lax about including the ALT= tags that identify images to nongraphical browsers. The Lynx minority subsequently issued a prompt and vocal protest, to which I am now responding.

The platform summary offers a remarkable view of the composition of our audience. Windows dominates, as you'd expect. But knowing the relative sizes of the Unix, Mac, and OS/2 slices of our audience can help us tune our editorial mix. A detailed version of this summary further decomposes the Unix slice into its AIX, SunOS, HP-UX, Linux, and other components.

A Finger on the Pulse

A fascinating event occurred on the BYTE Site in late October. Tom Thompson's October review "PowerMac Gets PCI" suddenly rose to the top of the charts. Normally the most popular articles on the site are the cover stories; users download the current month's cover story 500 to 600 times a day. But for three days, users read Tom's review at triple that rate. Clearly some influential site (we never found out which one) deemed the article important and was referring people to it. In a week, 6000 copies of that PowerMac article went out over the wire.

In the pre-Web-server era, the same kind of thing probably happened from time to time. But if an article struck a nerve and prompted readers to photocopy and pass it around, we'd never have known. The Web's ability to monitor demand in real time connects information providers to their customers in a way that's exhilarating and also a bit scary. There's nowhere to run, nowhere to hide: If you don't understand your audience, it's only because you don't want to.

TOOLWATCH wwwstat: http://www.ics.uci.edu/WebSoft/wwwstat/ This tool does the first level of refinement, boiling down a 5-MB server log into a concentrated 200-KB report that summarizes page usage by IP address and URL. It's a cinch to further refine wwwstat's output for detailed and customized analysis of your site's usage. BOOKNOTE Public Access to the Internet, edited by Brian Kahin and James Keller The MIT Press ISBN: 0-262-61118-X Price: $20 Will the Internet be available to everyone at an affordable price? Should the government intervene to make it so? A series of essays explores the political, social, and economic issues at stake. Downloadable Scripts You'll need to tweak these to match your own site's method of organization, but they'll point you in the right direction: articles.pl Reads file of URL-to-title mappings and wwwstat output, builds views of archive usage by issue and section. loglink.pl: Writes a record to a custom log file, returns a location header. browsers.pl: Reads a custom log file, tabulates browsers and platforms used. Reverse DNS Lookup in PERL illustration_link (4 Kbytes) Your Web server can probably convert dotted numeric IP addresses to Domain Naming System (DNS) names on the fly, but if you let it do this, your users will wait. Instead, do these lookups separately. Here's the essential algorithm in Perl: Read a line from the log, grab a dotted address, split it into an array, pack it into a 4-byte binary structure, and pass that to the gethostbyaddr function that's bound into most implementations of Perl. Tracking Usage of the BYTE Archive illustration_link (32 Kbytes) A Perl script merges a file of URL-to-title mappings with wwwstat's report of the number of hits for each URL. Then it builds views of the popularity of articles in the BYTE archive by issue and by section. Site Browser and Platform Summaries illustration_link (35 Kbytes) A CGI script that tracks links on a page can also accumulate useful information about the browsers and platforms used to access that page. These tables profile visitors to the BYTE Site's Resources page. Jon Udell (judell@bix.com) is BYTE's executive editor for new media. Copyright © 1994-1995