Thursday, May 6, 2010

Namespace trickery in Clojure

As you might have guessed from my last post, I have been playing around with web site scraping lately. This posed an interesting problem unrelated to HTML parsing. Each site needs its own function (or with refactoring a bunch of functions) to scrape the data. And generally you want to run these functions on a schedule, so you want a function to run all scrapers. And personally, I like magic, so the I wanted to just add scraper functions and have the aggregator function call them without me doing anything else. At first, I kept all my scrapers in a single scrapers.clj file, so I came up with the following solution.
;; add scraper metadata
(defmacro defscraper 
  [name & decls]
  (list* 'defn- (with-meta name 
                  (assoc (meta name) :scraper true)) name decls))

;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (filter 
   (fn [func] (get (meta (val func)) :scraper false))
   (ns-interns 'com.wombat.web.scrapers)))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall 
                 (for [[name scraper] scrapers] 
                   (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
Then I could just use defscraper instead of defn and voila, any function defined using defscraper would be run in parallel by (*run-all-scrapers*).

But after a while, several other issues came up. The scrapers file was getting long. I needed to define other function to work with scrapers, like individual functions that would store data from a scraper into a database or return information about the web site etc. So, I split the scrapers file and put each scraper into its own file and its own namespace. At first, I wanted to just refer all the scraper namespaces into the main scrapers namespace, but then I had an idea. What if instead of polluting the main namespace with all the scraper functions, I could keep them in their individual namespaces and find them by a standard name. So, I deleted the defscraper macro, changed all scraper function definitions to defn and called them all scraper. Then I changed the *collect-scrapers* and *run-all-scrapers* to look like this.
;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (map 
   #(get (ns-publics %1) 'scraper) 
   (filter #(contains? (ns-publics %1) 'scraper) (all-ns))))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall (for [scraper scrapers] 
                         (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
And that is that.

Wednesday, May 5, 2010

How to scrape websites in clojure for fun and profit

Let's say, you are hunting for a good deal on a hard drive and you want to monitor prices on newegg.com. You want an internal hard drive of (lets say) over 1TB in size. And you are too lazy to open a browser, so you want to do this in your favorite functional programming language. Well, maybe this is not very plausible, but this is a short primer on parsing web pages using Clojure, so there. You could use a Java-based HTML parser, such as HtmlCleaner. There was recently an excellent article about it. But lets say, that you would prefer to do it in a more functional style. Well, this is where Enlive comes in. I will assume, that you have emacs, slime, swank-clojure and leiningen all sorted out, so lets start with the meat of the process. The project.clj should be something like this:
(defproject newegg "1.0.0-SNAPSHOT"
  :description "newegg scraping"
  :dev-dependencies [[leiningen/lein-swank "1.2.0-SNAPSHOT"]]
  :dependencies [
                 [org.clojure/clojure "1.1.0"]
                 [org.clojure/clojure-contrib "1.1.0"]
                 [enlive "1.0.0-SNAPSHOT"]])
Now we can start coding, we are going to define selectors for HTML elements we are interested in and then return a map of the data they contain. In this instance, I am aiming to get price, short description and rating.
(ns newegg
  (:require [clojure.contrib.str-utils2 :as str2])
  (:require [clojure.contrib.json.read :as json])
  (:require [net.cgrand.enlive-html :as html]))

(def *base-url* (str 
                 "http://www.newegg.com/"
                 "Product/ProductList.aspx"
                 "?Submit=ENE&"
                 "N=2010150014%20103530090%201035915133&"
                 "bop=And&"
                 "ShowDeactivatedMark=False&"
                 "Order=RATING&"
                 "Pagesize=100"))

;;pick all div elements of class itemCell
(def *item-list-selector* [:div.itemCell])
;; pick spans of class itemDescription
(def *item-description-selector* [:span.itemDescription])
;; pick hidden inputs
(def *item-price-selector* [[:input (html/attr= :type "hidden")]])
;; pick anchor of class itemRating
(def *item-rating-selector* [:a.itemRating])

(defn html-data []
  (html/html-resource (java.net.URL. *base-url*)))

(defn item-list [] 
  (html/select (html-data) *item-list-selector*))

(defn item-properties [item]
  (list      
   (first 
    (:content 
     (first 
      (html/select item *item-description-selector*))))
   (:value (:attrs (first
                    (html/select item *item-price-selector*))))
   (if (empty? (html/select item *item-rating-selector*))
     ""
     (re-find #"\d+$" 
              (:title 
               (:attrs 
                (first
                 (html/select item *item-rating-selector*)))))))

  (defn scrape-and-print []
    (doseq [item (item-list)] (println (str2/join " " (item-properties item)))))

Sunday, May 2, 2010

Why switch from VIM to emacs?

Preface

OK, this topic has been discussed many times, sometimes, by much more competent people then myself. So, I will quickly reiterate main reasons one might consider switching and proceed to other issues.

Why not Vim?

Vim is just fine... for some things.

I have been using Vim for years (and was quite adamantly against Emacs). I work as a system administrator and for me, vi is one of the main tools of the trade, since it is on every system. On Linux systems you will mostly get Vim installed as the default vi, so learning and using Vim was natural. Most of my editing tasks were involving changing configuration files and writing relatively short scripts. Almost no debugging was involved and there as debugging, it was mostly just run/observe errors/fix script/run again cycle. For this type of use, Vim is perfect. It loads fast, so you can actually quit it every time you are done with editing and most testing/debugging can be accomplished by switching to a terminal window (or even better to a terminal window in a screen session). It is only when you start spending significant amounts of time writing code, Vim deficiencies start coming to light. What deficiencies? There are two main ones.

Vim is bad at communicating with external processes

While it is, of course, possible to run shell commands from Vim and even pipe data in the vim buffer, this is not enough. You need to be able to properly interact with a process such as a debugger. You need to send commands to it and capture their output, not run them and forget. Emacs is excelent at this, but Vim either has built-in support for a particular program (like gdb) or you are either out of luck or you will need a lot of hacking (like vimclojure).

Vim is not very good at editing multiple documents

Well, while this is not exactly true, Vim supports opening multiple files and recently added tab support, it is not as convenient or feels as natural as in Emacs. Multiple file support in Vim just feels awkward.

Extending Vim is a pain

Vim internal scripting language is strange, scripting with other languages compiled into vim, such as ruby or python is limited and not very portable. While many consider LISP to be strange, I find it to be not nearly as strange as vimscript.

Why Emacs?

Emacs is very good at communicating with external processes

So, you get a lot of benefits of the underlying OS right there in your editor. You also get much better integration with compilers, interpreters, REPL environments etc. You can use IRB and iPython or many other interactive dynamic language environments right out of the editor and get symbol completion and many other niceties. You can use programs like ssh, telnet or rsync to edit files on remote systems. There are too many uses to enumerate here, but I think you get the point.

Emacs is easy to configure

While originally you would have to configure Emacs by writing things in Emacs LISP, it is no longer required. Recent versions of Emacs sport very powerful customization interface, that allows you to change a lot of different aspects of the editor by pointing and clicking on things.

Emacs is old and the community is obsessive

While Vim has been around since 1991 and only got proper scripting support in 1998 (some would say in 2001), Emacs has been around since the 70's. And during these 30-something years, many talented people attempted to teach Emacs to do just about anything you could possibly imagine. So, if you want Emacs to do something, chances are, someone somewhere wrote a cute little bit of lisp that does exactly what you want.

LISP is good for you :)

And if Emacs is not doing something you want you can change just about anything. And you should. Cause anyone who calls himself a programmer should know at least a little bit of some lisp-like language and it might as well be Emacs LISP. It will alter you perception of reality, open your mind and chakras, walk your dog, neuter your cat and return your library books on time in under 10 lines of code.

But...

But I am so used to Vim

Emacs has a mode called Viper, that makes Emacs behave in Vimish way. It has different levels, in order to gradually phase out your Vim habits. If you tend to enter cold pool by first dipping your little toe, you might want to start with Viper. I am more of a dive, head-first, while screaming obscenities person, so I do not use it.

But Emacs takes forever to load

Well, first, it is not true. A simple Emacs setup loads as fast as simple Vim setup and a complicated Vim setup loads as slowly as a complicated Emacs setup. And at that Emacs has autoload ability that allows you to only load minimally required stuff at the startup and load the rest when it is actually required. And Emacs LISP can be byte-compiled to speed up loading times. And in any case, Emacs is more of a programmer's editor, not sysadmins (I am having my doubts, but so I heard), so it is not really intended to be closed after every edit. It is intended to be loaded once at the start of the day and never stopped again and possibly stopped when the work is over, but not necessarily.

But all those parentheses are awful!!!

No, they are not. They are beautiful. And if you let Emacs do the indentation and turn on highlite-parenthesis-mode, they are even more awesome. And anyway, I think a person who is used to typing things like :g/^"foo.*?"/d and :s/^foo\(.*\)bar$/bar\1foo/ shouldn't complain about syntax.

Thursday, April 29, 2010

Resuming posting

This blog has been on a hiatus for a while, mostly because I was busy or lazy or both. Now I will try and resume occasional posting. I think, I will start with some posts on switching from VIM to Emacs (as if that has never been blogged before) and setting up and using Clojure (same for this). And than I will see where that takes me.

Thursday, July 31, 2008

crontab to english translator

A couple of years ago I have written this script, that takes crontab entries from standard input, parses them and prints english translation. It is definitely not perfect and will bail at a lot of valid crontab entries, but for all it is worth here it is.

#!/usr/bin/python

import re
import os
import sys
import string

class CronJob:
"""A class describing a scheduled job."""
def __init__(self, str):
"""
Generate a new object from a crontab line. We should differentiate between the following types of crontabs:
1. something = something (raise exception)
2. (classic cron shedule)
3. [!&]word(arg)[,word(arg)...] (fcron style schedule)
4. #somestuff (comment, raise exception)
5. (empty line, raise exception)
"""

if re.compile("^\s*$").search(str):
raise NotACronJobError("EMPTY")
elif re.compile("^\s*#").search(str):
m = re.compile("^\s*#(.*)").search(str)
raise NotACronJobError("COMMENT", m.group(1))
elif re.compile("^\s*\S+\s*=.+").search(str):
m = re.compile("^\s*(\S+?)\s*=\s*(.+)").search(str)
raise NotACronJobError("VARIABLE", m.group(1), m.group(2))
elif re.compile("^(\*|\d+)").search(str) or re.compile("^[!&]\w+").search(str):
if re.compile("^!.+?\)\s*$").search(str): raise NotACronJobError("GARBAGE", str)
self._parseLine(str)
return
else:
raise(NotACronJobError("GARBAGE", str))

def _parseLine(self, str):
if re.compile("^[!&]\w+").search(str):
self.type = "fcron"
m = re.compile("^\S+\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.+)").search(str)
else:
self.type = "vixie"
m = re.compile("^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.+)").search(str)
self.min = self._parseDateTime(m.group(1), "min")
self.hr = self._parseDateTime(m.group(2), "hr")
self.dom = self._parseDateTime(m.group(3), "dom")
self.mon = self._parseDateTime(m.group(4), "mon")
self.dow = self._parseDateTime(m.group(5), "dow")
self.cmd = self._parseCmd(m.group(6))

def _parseDateTime(self, dt, type):
min = range(0,59)
hr = range(0,23)
dom = range(1,31)
mon = range(1,12)
dow = range(0-7)
if dt == "*":
return None
elif re.compile("^\d+$").search(dt):
return range(int(dt),int(dt) + 1)
elif re.compile(",").search(dt):
dts = dt.split(",")
parsed = [self._parseDateTime(x, type) for x in dts]
res = []
for x in parsed:
if res == None: res = []
res = res.extend(x)
return res
elif re.compile("\/").search(dt):
m = re.compile("(.+?)/(.+)").search(dt)
r = m.group(1)
st = m.group(2)
if r == "*":
r = eval(type)
else:
(x,y) = r.split("-")
r = range(int(x),int(y))
return range(r[0], r[-1], int(st))
elif re.compile("-").search(dt):
m = re.compile("(\d+)-(\d+)").search(dt)
return range(int(m.group(1)),int(m.group(2)))
else:
raise NotACronJobError("GARBAGE", dt)

def _parseCmd(self, cmd):
if re.compile("^\s*root\s*").search(cmd):
cmd = re.compile("^\s*root\s*").sub("", cmd)
return cmd

def __str__(self):
s = "Run %s" % self.cmd
if self.mon != None:
months = ("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
s = s + " in " + ",".join([months[x] for x in self.mon])
if self.dom != None:
tmp = ",".join(["%sth" % x for x in self.dom])
tmp = tmp.replace("1th", "1st")
tmp = tmp.replace("2th", "2nd")
tmp = tmp.replace("3th", "3rd")
s = s + " on " + tmp + " day"
if self.mon == None:
s = s + " of every month"
if self.dow != None:
week = ("sunday", "monday", "tuesday", "wednesday", "thirsday", "friday", "saturday")
s = s + " on " + ",".join([week[x] for x in self.dow])
if self.hr != None:
if len(self.hr) == 1 and len(self.min) == 1:
s = s + " at %s:%s" % (string.zfill(self.hr[0],2),string.zfill(self.min[0],2))
else:
s = s + " at " + ",".join([str(x) for x in self.hr])
if self.dow == None and self.dom == None:
s = s + " every day"
else:
s = s + " at %s minutes" % ",".join([str(x) for x in self.min]) + " of every hour "
return s


class NotACronJobError(Exception):
"""An exception raised by CronJob to indicate that the line in question doesn't contain a vaild cron schedule information."""
def __str__(self):
if self.args[0] == "EMPTY":
return "Empty Line"
elif self.args[0] == "COMMENT":
return "A comment: %s" % self.args[1]
elif self.args[0] == "VARIABLE":
return "An environment variable: %s = %s" % (self.args[1], self.args[2])
elif self.args[0] == "GARBAGE":
return "Uncronish thingamabob: %s" % self.args[1]
else:
return "If you don't know how to play with me, go to the other sandbox!"

if __name__ == "__main__":
for line in sys.stdin:
try:
print CronJob(line)
except NotACronJobError, err:
print err

Friday, May 16, 2008

Restoring MySQL databases CLI trick

It is very easy to dump and restore a database using mysql and mysqldump CLI utilities, just

# backup
mysqldump --single-transaction mydb > dump.sql
#restore
mysql mydb < dump.sql

and you are all. Unfortunately, if your database is several gigabytes and takes a long time to restore you might want to have some sort of output, to indicate where in the process your backup or restore is. For backup you just add -v flag to your mysqldump command and it will throw out some information about which table it is backing up. What about restore? While it is definitely possible to just go and check what table is being restored (mysqldump dumps tables in alphabetical order), I came up with a little clever trick to make the restore progress obvious and similar to mysqldump. Just add perl.

cat dump.sql | perl -ne '/Table structure for table \`(.*?)\`/ && do {chomp($t=`date`); print STDERR $t . " loading $1\n";}; print' | mysql mydb

Friday, March 14, 2008

Why I don't like Debian based distributions.

I have been happily using Fedora for a while now, but I keep a close eye on Ubuntu development, since it is my humble opinion, that nothing, at the moment, compares to Ubuntu in ease of use, hardware compatibility and general togetherness. I recommend Ubuntu to people who want to try out Linux, I ran Ubuntu myself for a while, I run beta versions of Ubuntu releases and file bugs (well, when I have time). Now I also have an Eee PC laptop running Ubuntu. I like Ubuntu. But I run Fedora as my main OS. The reason for this is Ubuntu being Debian derivative and as such dragging with it all the horrible Debian legacy. I honestly wish Ubuntu chose a different What is horrible about Debian? Well, this post intends to list a few things that annoy the hell out of me and that IMHO should have been fixed ages ago. Yes, I am aware, that I blaspheme.
  1. Package installation procedure - when a list of packages is being installed or upgraded, Debian package manager or DPKG does this procedure in stages. That is it will first unpack all the packages, then run all the pre-install scripts, then install the files, then run the post-install scripts etc. (I am not trying to be correct about the exact steps here). And while this behavior might seem logical therein lies a problem. If, for example, one of the packages' post-install scripts fails, dpkg reports a problem and quits and all the rest of the packages remain unconfigured. True, dpkg will continue where it left off when the suer resolves whatever problem is causing the script to fail or removes the offending package, but this is not the point. Lets consider a case where actual updated package is broken and script fails because of a syntax error. Once dpkg fails the system ends up in rather strange state. All the services that were to be updated were stopped, but weren't started again (since that happens in the post-install). New libraries were unpacked, but ldconfig weren't run. New kernel might have been installed, but new initrd wasn't generated and boot manager wasn't updated. Basically we have a broken system that needs careful fixing by a specialist who knows what he is doing. And even if you do know what you are doing, your choices are limited. You need to either fix the script yourself, repackage and reinstall, but that makes your system somewhat inconsistent or you need to completely remove the package, rerun dpkg to finish the install/upgrade of other packages in the queue  and try to reinstall the old version back, but that might not be possible since all the other packages might prevent the old version from being installed, so you need todescend the dependency hell and start selectively uninstalling and downgrading packages to get a working version of whatever software. Yes, some of it is also true about RPMs, but at least when one of the RPM installs fails all the rest of the packages are either NOT installed or installed COMPLETELY nothing except possibly the broken package is done half way.
  2. Package state markings - as I mentioned in previous paragraph, when dpkg fails to do some of its tasks it can be rerun and will proceed from the point it stopped (or fail in the same place). This is done by having very granular records of package state. APT seems to like to mark packages just a little bit too much and annoys its user. Lets say, I have started an install of a package that needs a 100MB of dependencies and suddenly I need to go somewhere. So, I hit CTRL-C, close the laptop and run out. Later I find that my laptop doesn't have a reader of some sort installed, for example FB reader and I need it right away to read some document. I hit apt-get install fbreader, but suddenly the whole 100MB of stuff starts downloading again. Why? Because APT marked all those packages for installation and will install them unless they are unmarked. Honestly I don't know how to easily unmark packages marked for install/upgrade short of doing dpkg --get-selections, manually editing the resulting list and piping it into dpkg --set-selections. There maybe a way to do this using GUI interface such as synaptic, but at a glance I couldn't find it. Other example of this "feature" is when you are trying to remove packages. Sometimes you see a package and think "I don't need this, why is it installed", so you dpkg -P it. And suddenly dpkg tells you, that the package is actually a dependence of something or other. But although dpkg proudly reported that it did not remove the package in question because of dependency problems, it DID however mark the package as "to be removed", so if ever the dependencies change this package might just disappear without any intervention.
  3. SysV scripts - Debian like most other Linux distros uses SysV startup. One feature though seems to be specifically done to annoy the hell out of the user. Every time a service that has a startup script is upgraded, it automatically setup to start at boot. Even if it was manually turned off before. In Fedora I can say chkconfig httpd off and Apache will not start until I say otherwise. On a less sophisticated system I can say something like rm -f /etc/rc?.d/S*httpd to achieve the same result. On Debian I can update-rc.d -f remove apache, but once an upgrade to the apache package is installed it will reinstate itself on its default runlevels and happily start on boot. As far as I know, there is NO way to prevent this. Ridiculous.
  4. Package management command set - this is not as much a problem as a way to a lot of confusion and it is not restricted to the package management system. There is just too much legacy in the Debian native commands. The package system provides a very good illustration. In Fedora I generally use two package management commands, yum and rpm. Yum mostly works with remote repositories and handles package installs, upgrades etc. RPM works with locally installed packages and manages installing from local file, querying the package DB, removing packages, package signing keys etc. In Debian it is not as simple. To install from remote repositories I use either apt-get. To search remote repositories I use apt-cache. To install from local file or remove package I use dpkg. To query package database I use dpkg-query. To manage keys I use apt-key. Each of these has its own specific subcommands and flags.
  5. DEFOMA - the Debian Font manager. Basically this is a convoluted something that is supposed to make all the font management automagical. Unfortunately all it seems to do is confuse anyone who tries to figure out what happens to fonts on the system.