Thursday, May 6, 2010

Namespace trickery in Clojure

As you might have guessed from my last post, I have been playing around with web site scraping lately. This posed an interesting problem unrelated to HTML parsing. Each site needs its own function (or with refactoring a bunch of functions) to scrape the data. And generally you want to run these functions on a schedule, so you want a function to run all scrapers. And personally, I like magic, so the I wanted to just add scraper functions and have the aggregator function call them without me doing anything else. At first, I kept all my scrapers in a single scrapers.clj file, so I came up with the following solution.
;; add scraper metadata
(defmacro defscraper 
  [name & decls]
  (list* 'defn- (with-meta name 
                  (assoc (meta name) :scraper true)) name decls))

;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (filter 
   (fn [func] (get (meta (val func)) :scraper false))
   (ns-interns 'com.wombat.web.scrapers)))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall 
                 (for [[name scraper] scrapers] 
                   (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
Then I could just use defscraper instead of defn and voila, any function defined using defscraper would be run in parallel by (*run-all-scrapers*).

But after a while, several other issues came up. The scrapers file was getting long. I needed to define other function to work with scrapers, like individual functions that would store data from a scraper into a database or return information about the web site etc. So, I split the scrapers file and put each scraper into its own file and its own namespace. At first, I wanted to just refer all the scraper namespaces into the main scrapers namespace, but then I had an idea. What if instead of polluting the main namespace with all the scraper functions, I could keep them in their individual namespaces and find them by a standard name. So, I deleted the defscraper macro, changed all scraper function definitions to defn and called them all scraper. Then I changed the *collect-scrapers* and *run-all-scrapers* to look like this.
;; compile a list of defined scrapers
(defn- *collect-scrapers* []
  (map 
   #(get (ns-publics %1) 'scraper) 
   (filter #(contains? (ns-publics %1) 'scraper) (all-ns))))

;; run all defined scrapers
(defn *run-all-scrapers* []
  (let [scrapers (*collect-scrapers*)
        threads (doall (for [scraper scrapers] 
                         (future (store-site (scraper)))))]
    (doseq [t threads] (deref t))))
And that is that.

1 comment:

  1. Hi there to everyone, the contents present at this web page are actually amazing for people knowledge, well, you can also visit Facebook Scraper for more Worth Web scraping services related information and knowledge. Keep up the good work.

    ReplyDelete