Tuesday, April 24, 2007

Web statistics from the command line

There are a lot of web statistics packages out there. And some of them are good. To name a few, there is Analog (especially when paired with Report Magic), AWStats and Visitors. There are also excellent commercial packages (but they don't pay me to advertise :) ).  Most of these have one particular problem. They generate a number of static reports. So if you just want to see how many hits your site received per day during last week they are excellent. Unfortunately if your question is more like "What are the top 10 pages hit by users with Internet Explorer who were referred to us by Google?" all of these programs become rather useless.
It is possible to record the web access data into a database and then do complex queries on it, but the site I work for receives millions of hits every day, so handling all those logs would require me to build another big system on top of already existing one just to handle logs. There are probably solutions for problems like this out there. They probably cost a lot and require expensive hardware to install and run. And our analysts keep bugging me with questions like the one I mentioned.
So, here comes the poor man's custom web statistics. Lets start with the question I asked about the google referred people using IE. For sake of simplicity I will assume the standard apache combined log format. Mind you, that some perl knowledge is assumed

perl -ne '/MSIE \d/&&/\d{3} (\d+|-) ".*?google\.com.*?"/&&/(POST|GET) (\/.*?) HTTP/&&$s{$2}++;END{foreach $t (keys %s){print "$s{$t}\t$t\n"}}' access.log | sort -gr | head -n 10

Confusing? Let me translate a bit. The -n flag causes Perl to go through the lines of the input performing the supplied code for each line. The other flag -e allows to specify the code on command line. Here is an article from NewsForge with more information on using Perl from the command line. So lets see what we did here. I will properly format the Perl code and comment.

# If line has "MSIE " in it we assume the user agent is IE.
# We are using && everywhere to make sure that if anything fails we stop counting this line
/MSIE \d/ &&
# The referer field in the combined log format goes after the return code and transfered data size
# We are looking for google.com string in the referer field
/\d{3} (\d+|-) ".*?google\.com.*?"/ &&
# The request starts with method and ends with the HTTP version.
# We will catch the requested page by using regex grouping
/(POST|GET) (\/.*?) HTTP/ &&
# Now we use the captured page name as a hash key and add one to the value (count page accesses)
# END block is only invoked after everything else was done
# for each key in the hash
foreach $t (keys %s) {
# print the page name and the number
print "$s{$t}\t$t\n";

Then we sort the Perl output using the UNIX sort command and pipe it through head to only print top 10 pages.
So, as you can see the code consists of the following parts:

  • Set of conditions that need to be satisfied to count current line

  • The regular expression to store the value we need to count

  • The standard ending with the hash and the END block

Here are a few examples of using regular expressions to match fields in the combined log format
/^\d+\.\d+\.\d+\.\d+/ - matches client IP address
/ \[(.*?)\] "(GET|POST)/ - will store the time stamp in $1
/(GET|POST) \/.*? HTTP/ - matches the request line
/\d{3} (\d+|-) "(.*?)"/ - will store the referrer field in $2
/" (\d{3}) (\d+|-) "/ - will store the return HTTP code in $1

By using this shortcut you can obtain a lot of interesting statistics from your web logs, without dealing with expensive software and multi gigabyte databases. Here is another command line. What this one does I will leave as an exercise to the reader :)

perl -ne '/MSIE/||next;/^.*?" "(.*)"$/;$s{$1}++;END{foreach $t (keys %s){print "$s{$t}\t$t\n"}}' access.log | sort -gr | head -n 10