Thursday, May 6, 2010

Namespace trickery in Clojure

As you might have guessed from my last post, I have been playing around with web site scraping lately. This posed an interesting problem unrelated to HTML parsing. Each site needs its own function (or with refactoring a bunch of functions) to scrape the data. And generally you want to run these functions on a schedule, so you want a function to run all scrapers. And personally, I like magic, so the I wanted to just add scraper functions and have the aggregator function call them without me doing anything else. At first, I kept all my scrapers in a single scrapers.clj file, so I came up with the following solution.
;; add scraper metadata
(defmacro defscraper 
  [name & decls]
  (list* 'defn- (with-meta name 
                  (assoc (meta name) :scraper true)) name decls))

;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (filter 
   (fn [func] (get (meta (val func)) :scraper false))
   (ns-interns 'com.wombat.web.scrapers)))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall 
                 (for [[name scraper] scrapers] 
                   (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
Then I could just use defscraper instead of defn and voila, any function defined using defscraper would be run in parallel by (*run-all-scrapers*).

But after a while, several other issues came up. The scrapers file was getting long. I needed to define other function to work with scrapers, like individual functions that would store data from a scraper into a database or return information about the web site etc. So, I split the scrapers file and put each scraper into its own file and its own namespace. At first, I wanted to just refer all the scraper namespaces into the main scrapers namespace, but then I had an idea. What if instead of polluting the main namespace with all the scraper functions, I could keep them in their individual namespaces and find them by a standard name. So, I deleted the defscraper macro, changed all scraper function definitions to defn and called them all scraper. Then I changed the *collect-scrapers* and *run-all-scrapers* to look like this.
;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (map 
   #(get (ns-publics %1) 'scraper) 
   (filter #(contains? (ns-publics %1) 'scraper) (all-ns))))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall (for [scraper scrapers] 
                         (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
And that is that.

No comments:

Post a Comment