Wednesday, May 5, 2010

How to scrape websites in clojure for fun and profit

Let's say, you are hunting for a good deal on a hard drive and you want to monitor prices on newegg.com. You want an internal hard drive of (lets say) over 1TB in size. And you are too lazy to open a browser, so you want to do this in your favorite functional programming language. Well, maybe this is not very plausible, but this is a short primer on parsing web pages using Clojure, so there. You could use a Java-based HTML parser, such as HtmlCleaner. There was recently an excellent article about it. But lets say, that you would prefer to do it in a more functional style. Well, this is where Enlive comes in. I will assume, that you have emacs, slime, swank-clojure and leiningen all sorted out, so lets start with the meat of the process. The project.clj should be something like this:
(defproject newegg "1.0.0-SNAPSHOT"
  :description "newegg scraping"
  :dev-dependencies [[leiningen/lein-swank "1.2.0-SNAPSHOT"]]
  :dependencies [
                 [org.clojure/clojure "1.1.0"]
                 [org.clojure/clojure-contrib "1.1.0"]
                 [enlive "1.0.0-SNAPSHOT"]])
Now we can start coding, we are going to define selectors for HTML elements we are interested in and then return a map of the data they contain. In this instance, I am aiming to get price, short description and rating.
(ns newegg
  (:require [clojure.contrib.str-utils2 :as str2])
  (:require [clojure.contrib.json.read :as json])
  (:require [net.cgrand.enlive-html :as html]))

(def *base-url* (str 
                 "http://www.newegg.com/"
                 "Product/ProductList.aspx"
                 "?Submit=ENE&"
                 "N=2010150014%20103530090%201035915133&"
                 "bop=And&"
                 "ShowDeactivatedMark=False&"
                 "Order=RATING&"
                 "Pagesize=100"))

;;pick all div elements of class itemCell
(def *item-list-selector* [:div.itemCell])
;; pick spans of class itemDescription
(def *item-description-selector* [:span.itemDescription])
;; pick hidden inputs
(def *item-price-selector* [[:input (html/attr= :type "hidden")]])
;; pick anchor of class itemRating
(def *item-rating-selector* [:a.itemRating])

(defn html-data []
  (html/html-resource (java.net.URL. *base-url*)))

(defn item-list [] 
  (html/select (html-data) *item-list-selector*))

(defn item-properties [item]
  (list      
   (first 
    (:content 
     (first 
      (html/select item *item-description-selector*))))
   (:value (:attrs (first
                    (html/select item *item-price-selector*))))
   (if (empty? (html/select item *item-rating-selector*))
     ""
     (re-find #"\d+$" 
              (:title 
               (:attrs 
                (first
                 (html/select item *item-rating-selector*)))))))

  (defn scrape-and-print []
    (doseq [item (item-list)] (println (str2/join " " (item-properties item)))))

1 comment:

  1. Hi Tyoska,

    Nice post. One little suggestion. Use operator '->' or '->>' it makes things much easier to read in the nested code.

    (->> (html/select item *item-rating-selector*) first :attrs :title (re-find #"\d+$"))

    DiG

    ReplyDelete