Let's say, you are hunting for a good deal on a hard drive and you want to monitor prices on
newegg.com. You want an internal hard drive of (lets say) over 1TB in size. And you are too lazy to open a browser, so you want to do this in your
favorite functional programming language. Well, maybe this is not very plausible, but this is a short primer on parsing web pages using Clojure, so there. You could use a Java-based HTML parser, such as
HtmlCleaner. There was recently an
excellent article about it. But lets say, that you would prefer to do it in a more functional style. Well, this is where
Enlive comes in. I will assume, that you have emacs, slime, swank-clojure and leiningen all sorted out, so lets start with the meat of the process. The project.clj should be something like this:
(defproject newegg "1.0.0-SNAPSHOT"
:description "newegg scraping"
:dev-dependencies [[leiningen/lein-swank "1.2.0-SNAPSHOT"]]
:dependencies [
[org.clojure/clojure "1.1.0"]
[org.clojure/clojure-contrib "1.1.0"]
[enlive "1.0.0-SNAPSHOT"]])
Now we can start coding, we are going to define selectors for HTML elements we are interested in and then return a map of the data they contain. In this instance, I am aiming to get price, short description and rating.
(ns newegg
(:require [clojure.contrib.str-utils2 :as str2])
(:require [clojure.contrib.json.read :as json])
(:require [net.cgrand.enlive-html :as html]))
(def *base-url* (str
"http://www.newegg.com/"
"Product/ProductList.aspx"
"?Submit=ENE&"
"N=2010150014%20103530090%201035915133&"
"bop=And&"
"ShowDeactivatedMark=False&"
"Order=RATING&"
"Pagesize=100"))
(def *item-list-selector* [:div.itemCell])
(def *item-description-selector* [:span.itemDescription])
(def *item-price-selector* [[:input (html/attr= :type "hidden")]])
(def *item-rating-selector* [:a.itemRating])
(defn html-data []
(html/html-resource (java.net.URL. *base-url*)))
(defn item-list []
(html/select (html-data) *item-list-selector*))
(defn item-properties [item]
(list
(first
(:content
(first
(html/select item *item-description-selector*))))
(:value (:attrs (first
(html/select item *item-price-selector*))))
(if (empty? (html/select item *item-rating-selector*))
""
(re-find #"\d+$"
(:title
(:attrs
(first
(html/select item *item-rating-selector*)))))))
(defn scrape-and-print []
(doseq [item (item-list)] (println (str2/join " " (item-properties item)))))
Hi Tyoska,
ReplyDeleteNice post. One little suggestion. Use operator '->' or '->>' it makes things much easier to read in the nested code.
(->> (html/select item *item-rating-selector*) first :attrs :title (re-find #"\d+$"))
DiG