Friday, May 11, 2007

Quest for web log analysis software

I am currently searching for a web log analysis package for our site. I have to say that the more I look at the available options the more disgusted I get. Basically what I am looking for is wel log analysis software with following features:
  • Reading data from web server logs (not using custom javascript to record hits)
  • Storing log data in a SQL database, so I can use SQL to generate custom reports
  • Capable of generating custom reports with custom graphs and charts
  • Capable of reading custom log formats (such as Apache LogFormat strings)
  • Able to "drill down/zoom in" into the reports for more information
  • Running on Linux, BSD or Solaris.
It seems that to get all of these is close to impossible. Most open source packages and some commercial ones are too primitive and and the ones that do seem to hold some promise are very good at obscuring the actual functionality they provide. The "too primitive" category consists of packages that read in log files and generate a preset number of common reports, such as hits per day, hourly distribution, web browser distribution etc. If this is enough for you, the open source provides adequate  solutions.
  • Analog - very configurable. Generates a couple of dozen different reports.
  • Awstats - similar to analog, the reports are a little nicer. Slightly less configurable, but written in perl therefore should be easy to customize.
  • Visitors - a rather primitive, but very fast log analysis program. Incapable of storing log information, therefore needs complete set of logs every time.
A little more promising is the package called Lire. It is written in perl and distributed under GPL. Unfortunately it is not without its quirks. To say that the documentation is lacking would not be true. There is an extensive user manual. But after digging through the manual even simple questions were left unanswered. How do I import log files manually? How do I generate a report manually? Where are the configuration files? What is the XML schema for the configuration? How do I make a custom log format converter? And so on. I know that documentation was always a weak spot in many open source projects, but in this case, the obstacles are in all the wrong places. As soon as I started to look for something in the docs it wasn't there. On the positive side the source is all there, it is written in perl, there are a lot of included examples and with a bit of patience I was able to figure out how things work. Then the next set of disappointments came through. In order to parse custom log formats you have to write a perl module. And although there is some documentation about the process in the developer's manual, this is a bit too much. I would expect the system to either take apache LogFormat string or to ask for a regex and field descriptions. Both log file importing and report generation are provided by the same script lr_cron. The script is supposed to run from cron and doesn't take any parameters. As far as I understand there is no way to generate a particular report from a particular data store. And log file importing is SLOW. I have tried to import a test log file of a couple million lines and it took close to 4 hours. In a few days I will get a new server with dual dual-core Xeons and 8GB of RAM which should be a bit faster, than the box I used in my tests, but I expect to process 8-10 million lines of logs daily and it is not supposed to take all day.

At this point I have decided to turn to commercial solutions. The first few I found were silly windows programs similar in functionality to the ones I mentioned before (analog and friends), but without as much customization. Next batch consisted of hosted solutions that required you to insert pieces of script into your pages and gathered statistics that way. It is a nice technique, but not quite what I had in mind. We already have a solution like that. Unexpectedly, I found it very difficult to figure out if the software in question is hosted or standalone and if it uses logs or scriptlets. It seems that vendors of web analysis software go to extreme measures to hide any and all technical information related to their software. Then I started on the "big boys". And this was even more disappointing than the OSS world.
  • WebTrends - one of the most famous solutions on the market. Rumored to be also one of the most expensive and powerful. Unfortunately I was not able to find out. There is no pricing information on the site. To get any information about product features you have to fill out a form with all your information (email, name, address, a bunch of marketing questions etc.) and it will try to sign you up for a few newsletters on the way. The resulting product sheets contain a lot of things about marketing needs of a moder business, but nothing about the actual features of the product. Or at least no technical information. In order to obtain a trial version you have to fill out a form and supposedly a representative will contact you. Only then I found out that the package only runs on windows.
  • ClickTracks Pro - There is no more information on the site than in case of WebTrends. I have yet to figure out what platforms this software runs on. I have yet to figure out if it is capable of custom reporting and importing custom logs. The pricing info is on the site though. A Pro version license costs $9,344 and there is no trial version. All they offer is to run a trial report on your data. Before I spend 10K on a software package I expect to become completely familiar with all of its features, requirements, quirks etc. I do not see how I can do that in this case.
  • Omniture SiteCatalyst - same as the previous two. No information on the site. No trial. And to add insult to injury, the link to the product data sheet actually points to the "Contuct Us" page. Screw you, I will contact you once I know I might want your product. I am not wasting time on your sales pitches before I find out what exactly are you selling.
  • Unica NetTracker - a much better experience. It seems the product actually uses log files, works on both Windows and UNIX, uses database to store the data, supports several different database products including MS-SQL and MySQL. There is a trial version, but it needs to be requested. I will be able to tell more once I actually get the trial and play with it. I am putting a lot of hope into this one.
  • Sawmill - A nice software package that makes reports out of logs. Reasonably priced. Less geared towards web marketing. There is a downloadable demo and fairly complete documentation. Supports MySQL and internal database. Runs on Windows and Linux. I have been playing with the demo version for tha past couple of days. Custom log formats need to be defined manually by creating a log format definition file. The file format is relatively straight forward and the support is pretty good. The GUI configuration wizard is capable of rendering apache LogFormat strings into a proper log format definition. The MySQL support is unfortunately lacking. The queries are unoptimized to extremes and take forever to complete even on very good hardware with a lot of server level optimizations. Internal database seems reassonable. Log file imports are rather slow and the resulting database takes a lot of space, but reporting capabilities are fantastic. The development team response is very good. I have submitted several bug reports and it is possible that by the time I will have to decide on particular package the database bugs will be fixed.
All in all this research has been pretty disappointing. If anyone has any suggestions as to other solutions I haven't tried or has anything good or bad to say about the products I mentioned you are welcome to do so in the comments.


  1. You might want to also look at Wusage for a commercial solution. I've used lots of the free and commercial packages over the years and in the end, settled on used Webtrends and Wusage. Webtrends we used because it made prettier reports in Microsoft Word and Excel format for administrators who were microsoft-junkies. Wusage we used because it was much faster and easier to administer. Just so you know, Webtrends is a PIG for both memory and disk resources and we had to dedicate a box just to run stats.

    If you have multiple domains to analyze and millions of hits per X, then the "ISP mode" is better than "Advanced mode" because in ISP mode all sites share a common DNS lookup table.

  2. Thank you. Webtrends seems to be windows only. I am not sure good web analytics justify adding a windows server to my setup. I will take a look at Wusage.