Tuesday, May 11, 2010

Short tutorial on extending Leiningen

We all use and love leiningen, the ultimate Clojure build tool. Sometimes, though, we want leiningen to do something it doesn't know how to do. Here is a short and simple tutorial on making your own leiningen tasks. In your project.clj, after the (defproject ...) form, add the following:
(ns leiningen.hello)
(defn hello
[project]
(println "Hello Leiningen!")
(println "ants"))

Now, when you run lein hello, you will see it print out a message to Leiningen from the ants.

So, to make a new leiningen task, all you need to do is define a new namespace under leiningen and define a function by the same name. The project variable passed to the function is a hash map containing all project information. For example, here is a slight modification of the hello task.
(ns leiningen.hello)
(defn hello
[project]
(println (format "Hello from %s project!" (:name project))))

This should print you a greeting from your project. To see what other information is in the project variable, I came up with the following task.
(ns leiningen.info
"Print all project variables and their values"
(:use [clojure.contrib.pprint :only [pprint pprint-indent]]))
(defn info
[project]
(doseq [key (keys project)] 
(println (format "%s:" (name key)))
(pprint (get project key))))

This is almost all there is to it, there are a couple of additional notes.
  • All extra arguments after the task name will be also passed to the task function, so if you want to handle arguments, define your task handler like this (defn sometask [project & args] ... )
  • Your new task will not appear in the list of available tasks and running help task on it will generate error. This is because leiningen help task uses classpath to look for tasks and will not find anything that is inside the project.clj file. If this is important to you, you can put your task into a separate project, generate a jar file and copy it into the lib directory of your main project.
  • If you do go for the task jar solution, the help task looks for the doc string in your namespace definition for the help message to display. So, your namespace definition should look like this
    (ns leiningen.silly
    "This task does something silly")
    (defn silly 
    [project] 
    (println "Your project SUCKS!"))
This is it kids.

Thursday, May 6, 2010

Namespace trickery in Clojure

As you might have guessed from my last post, I have been playing around with web site scraping lately. This posed an interesting problem unrelated to HTML parsing. Each site needs its own function (or with refactoring a bunch of functions) to scrape the data. And generally you want to run these functions on a schedule, so you want a function to run all scrapers. And personally, I like magic, so the I wanted to just add scraper functions and have the aggregator function call them without me doing anything else. At first, I kept all my scrapers in a single scrapers.clj file, so I came up with the following solution.
;; add scraper metadata
(defmacro defscraper 
  [name & decls]
  (list* 'defn- (with-meta name 
                  (assoc (meta name) :scraper true)) name decls))

;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (filter 
   (fn [func] (get (meta (val func)) :scraper false))
   (ns-interns 'com.wombat.web.scrapers)))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall 
                 (for [[name scraper] scrapers] 
                   (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
Then I could just use defscraper instead of defn and voila, any function defined using defscraper would be run in parallel by (*run-all-scrapers*).

But after a while, several other issues came up. The scrapers file was getting long. I needed to define other function to work with scrapers, like individual functions that would store data from a scraper into a database or return information about the web site etc. So, I split the scrapers file and put each scraper into its own file and its own namespace. At first, I wanted to just refer all the scraper namespaces into the main scrapers namespace, but then I had an idea. What if instead of polluting the main namespace with all the scraper functions, I could keep them in their individual namespaces and find them by a standard name. So, I deleted the defscraper macro, changed all scraper function definitions to defn and called them all scraper. Then I changed the *collect-scrapers* and *run-all-scrapers* to look like this.
;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (map 
   #(get (ns-publics %1) 'scraper) 
   (filter #(contains? (ns-publics %1) 'scraper) (all-ns))))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall (for [scraper scrapers] 
                         (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
And that is that.

Wednesday, May 5, 2010

How to scrape websites in clojure for fun and profit

Let's say, you are hunting for a good deal on a hard drive and you want to monitor prices on newegg.com. You want an internal hard drive of (lets say) over 1TB in size. And you are too lazy to open a browser, so you want to do this in your favorite functional programming language. Well, maybe this is not very plausible, but this is a short primer on parsing web pages using Clojure, so there. You could use a Java-based HTML parser, such as HtmlCleaner. There was recently an excellent article about it. But lets say, that you would prefer to do it in a more functional style. Well, this is where Enlive comes in. I will assume, that you have emacs, slime, swank-clojure and leiningen all sorted out, so lets start with the meat of the process. The project.clj should be something like this:
(defproject newegg "1.0.0-SNAPSHOT"
  :description "newegg scraping"
  :dev-dependencies [[leiningen/lein-swank "1.2.0-SNAPSHOT"]]
  :dependencies [
                 [org.clojure/clojure "1.1.0"]
                 [org.clojure/clojure-contrib "1.1.0"]
                 [enlive "1.0.0-SNAPSHOT"]])
Now we can start coding, we are going to define selectors for HTML elements we are interested in and then return a map of the data they contain. In this instance, I am aiming to get price, short description and rating.
(ns newegg
  (:require [clojure.contrib.str-utils2 :as str2])
  (:require [clojure.contrib.json.read :as json])
  (:require [net.cgrand.enlive-html :as html]))

(def *base-url* (str 
                 "http://www.newegg.com/"
                 "Product/ProductList.aspx"
                 "?Submit=ENE&"
                 "N=2010150014%20103530090%201035915133&"
                 "bop=And&"
                 "ShowDeactivatedMark=False&"
                 "Order=RATING&"
                 "Pagesize=100"))

;;pick all div elements of class itemCell
(def *item-list-selector* [:div.itemCell])
;; pick spans of class itemDescription
(def *item-description-selector* [:span.itemDescription])
;; pick hidden inputs
(def *item-price-selector* [[:input (html/attr= :type "hidden")]])
;; pick anchor of class itemRating
(def *item-rating-selector* [:a.itemRating])

(defn html-data []
  (html/html-resource (java.net.URL. *base-url*)))

(defn item-list [] 
  (html/select (html-data) *item-list-selector*))

(defn item-properties [item]
  (list      
   (first 
    (:content 
     (first 
      (html/select item *item-description-selector*))))
   (:value (:attrs (first
                    (html/select item *item-price-selector*))))
   (if (empty? (html/select item *item-rating-selector*))
     ""
     (re-find #"\d+$" 
              (:title 
               (:attrs 
                (first
                 (html/select item *item-rating-selector*)))))))

  (defn scrape-and-print []
    (doseq [item (item-list)] (println (str2/join " " (item-properties item)))))

Sunday, May 2, 2010

Why switch from VIM to emacs?

Preface

OK, this topic has been discussed many times, sometimes, by much more competent people then myself. So, I will quickly reiterate main reasons one might consider switching and proceed to other issues.

Why not Vim?

Vim is just fine... for some things.

I have been using Vim for years (and was quite adamantly against Emacs). I work as a system administrator and for me, vi is one of the main tools of the trade, since it is on every system. On Linux systems you will mostly get Vim installed as the default vi, so learning and using Vim was natural. Most of my editing tasks were involving changing configuration files and writing relatively short scripts. Almost no debugging was involved and there as debugging, it was mostly just run/observe errors/fix script/run again cycle. For this type of use, Vim is perfect. It loads fast, so you can actually quit it every time you are done with editing and most testing/debugging can be accomplished by switching to a terminal window (or even better to a terminal window in a screen session). It is only when you start spending significant amounts of time writing code, Vim deficiencies start coming to light. What deficiencies? There are two main ones.

Vim is bad at communicating with external processes

While it is, of course, possible to run shell commands from Vim and even pipe data in the vim buffer, this is not enough. You need to be able to properly interact with a process such as a debugger. You need to send commands to it and capture their output, not run them and forget. Emacs is excelent at this, but Vim either has built-in support for a particular program (like gdb) or you are either out of luck or you will need a lot of hacking (like vimclojure).

Vim is not very good at editing multiple documents

Well, while this is not exactly true, Vim supports opening multiple files and recently added tab support, it is not as convenient or feels as natural as in Emacs. Multiple file support in Vim just feels awkward.

Extending Vim is a pain

Vim internal scripting language is strange, scripting with other languages compiled into vim, such as ruby or python is limited and not very portable. While many consider LISP to be strange, I find it to be not nearly as strange as vimscript.

Why Emacs?

Emacs is very good at communicating with external processes

So, you get a lot of benefits of the underlying OS right there in your editor. You also get much better integration with compilers, interpreters, REPL environments etc. You can use IRB and iPython or many other interactive dynamic language environments right out of the editor and get symbol completion and many other niceties. You can use programs like ssh, telnet or rsync to edit files on remote systems. There are too many uses to enumerate here, but I think you get the point.

Emacs is easy to configure

While originally you would have to configure Emacs by writing things in Emacs LISP, it is no longer required. Recent versions of Emacs sport very powerful customization interface, that allows you to change a lot of different aspects of the editor by pointing and clicking on things.

Emacs is old and the community is obsessive

While Vim has been around since 1991 and only got proper scripting support in 1998 (some would say in 2001), Emacs has been around since the 70's. And during these 30-something years, many talented people attempted to teach Emacs to do just about anything you could possibly imagine. So, if you want Emacs to do something, chances are, someone somewhere wrote a cute little bit of lisp that does exactly what you want.

LISP is good for you :)

And if Emacs is not doing something you want you can change just about anything. And you should. Cause anyone who calls himself a programmer should know at least a little bit of some lisp-like language and it might as well be Emacs LISP. It will alter you perception of reality, open your mind and chakras, walk your dog, neuter your cat and return your library books on time in under 10 lines of code.

But...

But I am so used to Vim

Emacs has a mode called Viper, that makes Emacs behave in Vimish way. It has different levels, in order to gradually phase out your Vim habits. If you tend to enter cold pool by first dipping your little toe, you might want to start with Viper. I am more of a dive, head-first, while screaming obscenities person, so I do not use it.

But Emacs takes forever to load

Well, first, it is not true. A simple Emacs setup loads as fast as simple Vim setup and a complicated Vim setup loads as slowly as a complicated Emacs setup. And at that Emacs has autoload ability that allows you to only load minimally required stuff at the startup and load the rest when it is actually required. And Emacs LISP can be byte-compiled to speed up loading times. And in any case, Emacs is more of a programmer's editor, not sysadmins (I am having my doubts, but so I heard), so it is not really intended to be closed after every edit. It is intended to be loaded once at the start of the day and never stopped again and possibly stopped when the work is over, but not necessarily.

But all those parentheses are awful!!!

No, they are not. They are beautiful. And if you let Emacs do the indentation and turn on highlite-parenthesis-mode, they are even more awesome. And anyway, I think a person who is used to typing things like :g/^"foo.*?"/d and :s/^foo\(.*\)bar$/bar\1foo/ shouldn't complain about syntax.

Thursday, April 29, 2010

Resuming posting

This blog has been on a hiatus for a while, mostly because I was busy or lazy or both. Now I will try and resume occasional posting. I think, I will start with some posts on switching from VIM to Emacs (as if that has never been blogged before) and setting up and using Clojure (same for this). And than I will see where that takes me.

Thursday, July 31, 2008

crontab to english translator

A couple of years ago I have written this script, that takes crontab entries from standard input, parses them and prints english translation. It is definitely not perfect and will bail at a lot of valid crontab entries, but for all it is worth here it is.

#!/usr/bin/python

import re
import os
import sys
import string

class CronJob:
"""A class describing a scheduled job."""
def __init__(self, str):
"""
Generate a new object from a crontab line. We should differentiate between the following types of crontabs:
1. something = something (raise exception)
2. (classic cron shedule)
3. [!&]word(arg)[,word(arg)...] (fcron style schedule)
4. #somestuff (comment, raise exception)
5. (empty line, raise exception)
"""

if re.compile("^\s*$").search(str):
raise NotACronJobError("EMPTY")
elif re.compile("^\s*#").search(str):
m = re.compile("^\s*#(.*)").search(str)
raise NotACronJobError("COMMENT", m.group(1))
elif re.compile("^\s*\S+\s*=.+").search(str):
m = re.compile("^\s*(\S+?)\s*=\s*(.+)").search(str)
raise NotACronJobError("VARIABLE", m.group(1), m.group(2))
elif re.compile("^(\*|\d+)").search(str) or re.compile("^[!&]\w+").search(str):
if re.compile("^!.+?\)\s*$").search(str): raise NotACronJobError("GARBAGE", str)
self._parseLine(str)
return
else:
raise(NotACronJobError("GARBAGE", str))

def _parseLine(self, str):
if re.compile("^[!&]\w+").search(str):
self.type = "fcron"
m = re.compile("^\S+\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.+)").search(str)
else:
self.type = "vixie"
m = re.compile("^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.+)").search(str)
self.min = self._parseDateTime(m.group(1), "min")
self.hr = self._parseDateTime(m.group(2), "hr")
self.dom = self._parseDateTime(m.group(3), "dom")
self.mon = self._parseDateTime(m.group(4), "mon")
self.dow = self._parseDateTime(m.group(5), "dow")
self.cmd = self._parseCmd(m.group(6))

def _parseDateTime(self, dt, type):
min = range(0,59)
hr = range(0,23)
dom = range(1,31)
mon = range(1,12)
dow = range(0-7)
if dt == "*":
return None
elif re.compile("^\d+$").search(dt):
return range(int(dt),int(dt) + 1)
elif re.compile(",").search(dt):
dts = dt.split(",")
parsed = [self._parseDateTime(x, type) for x in dts]
res = []
for x in parsed:
if res == None: res = []
res = res.extend(x)
return res
elif re.compile("\/").search(dt):
m = re.compile("(.+?)/(.+)").search(dt)
r = m.group(1)
st = m.group(2)
if r == "*":
r = eval(type)
else:
(x,y) = r.split("-")
r = range(int(x),int(y))
return range(r[0], r[-1], int(st))
elif re.compile("-").search(dt):
m = re.compile("(\d+)-(\d+)").search(dt)
return range(int(m.group(1)),int(m.group(2)))
else:
raise NotACronJobError("GARBAGE", dt)

def _parseCmd(self, cmd):
if re.compile("^\s*root\s*").search(cmd):
cmd = re.compile("^\s*root\s*").sub("", cmd)
return cmd

def __str__(self):
s = "Run %s" % self.cmd
if self.mon != None:
months = ("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
s = s + " in " + ",".join([months[x] for x in self.mon])
if self.dom != None:
tmp = ",".join(["%sth" % x for x in self.dom])
tmp = tmp.replace("1th", "1st")
tmp = tmp.replace("2th", "2nd")
tmp = tmp.replace("3th", "3rd")
s = s + " on " + tmp + " day"
if self.mon == None:
s = s + " of every month"
if self.dow != None:
week = ("sunday", "monday", "tuesday", "wednesday", "thirsday", "friday", "saturday")
s = s + " on " + ",".join([week[x] for x in self.dow])
if self.hr != None:
if len(self.hr) == 1 and len(self.min) == 1:
s = s + " at %s:%s" % (string.zfill(self.hr[0],2),string.zfill(self.min[0],2))
else:
s = s + " at " + ",".join([str(x) for x in self.hr])
if self.dow == None and self.dom == None:
s = s + " every day"
else:
s = s + " at %s minutes" % ",".join([str(x) for x in self.min]) + " of every hour "
return s


class NotACronJobError(Exception):
"""An exception raised by CronJob to indicate that the line in question doesn't contain a vaild cron schedule information."""
def __str__(self):
if self.args[0] == "EMPTY":
return "Empty Line"
elif self.args[0] == "COMMENT":
return "A comment: %s" % self.args[1]
elif self.args[0] == "VARIABLE":
return "An environment variable: %s = %s" % (self.args[1], self.args[2])
elif self.args[0] == "GARBAGE":
return "Uncronish thingamabob: %s" % self.args[1]
else:
return "If you don't know how to play with me, go to the other sandbox!"

if __name__ == "__main__":
for line in sys.stdin:
try:
print CronJob(line)
except NotACronJobError, err:
print err

Friday, May 16, 2008

Restoring MySQL databases CLI trick

It is very easy to dump and restore a database using mysql and mysqldump CLI utilities, just

# backup
mysqldump --single-transaction mydb > dump.sql
#restore
mysql mydb < dump.sql

and you are all. Unfortunately, if your database is several gigabytes and takes a long time to restore you might want to have some sort of output, to indicate where in the process your backup or restore is. For backup you just add -v flag to your mysqldump command and it will throw out some information about which table it is backing up. What about restore? While it is definitely possible to just go and check what table is being restored (mysqldump dumps tables in alphabetical order), I came up with a little clever trick to make the restore progress obvious and similar to mysqldump. Just add perl.

cat dump.sql | perl -ne '/Table structure for table \`(.*?)\`/ && do {chomp($t=`date`); print STDERR $t . " loading $1\n";}; print' | mysql mydb