- Introduction
- Installation
- Basic Setup
- Image Downloader
- Concurrency
- Results
- Next Steps
- Full Code Listing
Contents
Building A Concurrent Web Scraper With Haskell
updated: April 16, 2012
Introduction
Let's make a concurrent web scraper! We will use Haskell, because it allows easy concurrency. We will use the HXT library to do the scraping. If you want to follow the HXT
bits, you should be comfortable with Arrows in Haskell. If you're not, take a moment to read up on Arrows.
If you don't care about the scraping bits, jump straight to the concurrency section.
Installation
Make sure you have the hxt
, url
and http
packages:
cabal install hxt
cabal install url
cabal install http
cabal install maybet
Basic Setup
First, let's write some basic functions to make life easier for ourselves:
openUrl :: String -> MaybeT IO String
openUrl url = case parseURI url of
Nothing -> fail ""
Just u -> liftIO (getResponseBody =<< simpleHTTP (mkRequest GET u))
css :: ArrowXml a => String -> a XmlTree XmlTree
css tag = multi (hasName tag)
get :: String -> IO (IOSArrow XmlTree (NTree XNode))
get url = do
contents <- runMaybeT $ openUrl url
return $ readString [withParseHTML yes, withWarnings no] (fromMaybe "" contents)
I say basic because they will be our building blocks, not because they are easy :P Let's see how they work.
openUrl
is a function that will download a web page for us. It returns a MaybeT
Monad Transformer. We can use it like
contents <- runMaybeT $ openUrl "http://example.com"
and contents will be a Just
if the operation was successful, or Nothing
otherwise.
css
will allow us to use css selectors on the downloaded page.
get
is where things get interesting. First, we download a page using
contents <- runMaybeT $ openUrl url
like we talked about. Next, we parse the page using HXT
:
readString [withParseHTML yes, withWarnings no] contents
readString
takes some options as its first parameter:
withParseHTML
: Parse as HTML, which makes sure the parser doesn't break on things like the doctype.withWarnings
: Prints out warnings about malformed html if it's switched on. Since so much of the web is malformed html, I switched it off :P
Now we are ready to start.
Image Downloader
Let's write something that downloads all the images from a given page.
First, let's get a parsed page:
main = do
page <- get "http://www.reddit.com/r/pics"
page
is now an Arrow. We can run this Arrow at any time to get its value by using runX
. Let's try it now:
ghci>runX page
[NTree (XTag "/" [NTree (XAttr "transfer-Status") [NTree (XText "200") []],NTree (XAttr "transfer-Message") [NTree (XText "OK") []],NTree (XAttr "transfer-URI") [NTree (XText "string:") []],NTree (XAttr "source") [NTree (XText "\"<!DOCTYPE html PUBLIC \\\"-//W3C//DTD XHTML 1.0 ...\"") []],NTree (XAttr "transfer-Encoding") [NTree (XText "UNICODE") []],NTree (XAttr "doctype-name")
(...many lines skipped...)
Wow, that looks confusing. Let's select only what we want. Get just the images:
ghci>runX $ page >>> css "img"
[NTree (XTag "img" [NTree (XAttr "id") [NTree (XText "header-img") []],NTree (XAttr "src") [NTree (XText "http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png") []]
...
Aha! Much nicer. Now let's get just the src
's:
ghci>runX $ page >>> css "img" >>> getAttrValue "src"
["http://f.thumbs.redditmedia.com/nDSO6j0fVaKEV5Hw.png","http://pixel.redditmedia.com/pixel/of_defenestration.png"
...
Done! That was easy. Now all we need to do is download these images and save them to disk.
main = do
url <- parseArgs
doc <- get url
imgs <- runX . images $ doc
sequence_ $ map download imgs
The first three lines of our main
function get a list of links. The images
function is very simple:
images tree = tree >>> css "img" >>> getAttrValue "src"
It gets a list of all the image sources, just like we had talked about.
The fourth lines maps the download
function over this list to create a list of IO actions. Then we feed that list into sequence_
, which runs the actions one at a time and throws away the return values. We could have used sequence
instead, which would have printed the return values.
Here's the download
function:
download url = do
content <- runMaybeT $ openUrl url
case content of
Nothing -> putStrLn $ "bad url: " ++ url
Just _content -> do
let name = tail . uriPath . fromJust . parseURI $ url
B.writeFile name (B.pack _content)
We have to write out binary data, so we use the writeFile
defined in Data.ByteString.Char8
, which operates on ByteString
s. This is why we need to convert our String
to a ByteString
first using B.pack
.
We are also able to do error checking thanks to our openUrl function being a MaybeT
. If we didn't get any content, we just print out "bad url: [url]". Otherwise we download the image.
Concurrency
After all that work, the concurrent bit seems almost anti-climactic.
First, install the parallel-io
package:
cabal install parallel-io
Import it into the script:
import Control.Concurrent.ParallelIO
ParallelIO
defines a new function called parallel_
which we can use anywhere we would have used sequence_
. The IO actions will then get performed concurrently.
Change the end of the script to this:
...
imgs <- runX . images $ doc
parallel_ $ map download imgs
stopGlobalPool
stopGlobalPool
needs to be called after the last use of a global parallelism combinator. It cleans up the thread pool before shutdown.
Now build the concurrent version (enabling runtime system options):
$ ghc --make grabber_par.hs -threaded -rtsopts
And run it with +RTS -N[number of threads]
:
$ ./grabber_par +RTS -N4
Results
Here's how the two versions performed on my machine:
Without parallelization:
$ time ./grabber "http://www.reddit.com/r/pics"
real 0m10.341s
user 0m0.203s
sys 0m0.048s
With parallelization (four threads):
$ time ./grabber_par "http://www.reddit.com/r/pics" +RTS -N4
real 0m3.490s
user 0m0.477s
sys 0m0.154s
Almost a third of the time!
The ParallelIO
library uses MVar
s to keep things in sync. Read more about ParallelIO or MVars.
Next Steps
Next steps involve writing this as a crawler that visits links on the page up to a depth of N
as well as some way to keep track of visited pages. We also want to keep track of name collisions. If you try to download two images, both named "test.jpg", the concurrent version will error out. The non-concurrent version would just overwrite one image with another, which isn't any good either. On the crawling side, we should watch out for robots.txt files and META tag directives to be polite. And ask for gzip'd data to reduce request time.
We could also parallelize more than just the download, but its a start!
Full Code Listing
import qualified Data.ByteString.Char8 as B
import Data.Tree.NTree.TypeDefs
import Data.Maybe
import Text.XML.HXT.Core
import Control.Monad
import Control.Monad.Trans
import Control.Monad.Maybe
import Network.HTTP
import Network.URI
import System.Environment
import Control.Concurrent.ParallelIO
-- helper function for getting page content
openUrl :: String -> MaybeT IO String
openUrl url = case parseURI url of
Nothing -> fail ""
Just u -> liftIO (getResponseBody =<< simpleHTTP (mkRequest GET u))
css :: ArrowXml a => String -> a XmlTree XmlTree
css tag = multi (hasName tag)
get :: String -> IO (IOSArrow XmlTree (NTree XNode))
get url = do
contents <- runMaybeT $ openUrl url
return $ readString [withParseHTML yes, withWarnings no] (fromMaybe "" contents)
images tree = tree >>> css "img" >>> getAttrValue "src"
parseArgs = do
args <- getArgs
case args of
(url:[]) -> return url
otherwise -> error "usage: grabber [url]"
download url = do
content <- runMaybeT $ openUrl url
case content of
Nothing -> putStrLn $ "bad url: " ++ url
Just _content -> do
let name = tail . uriPath . fromJust . parseURI $ url
B.writeFile name (B.pack _content)
main = do
url <- parseArgs
doc <- get url
imgs <- runX . images $ doc
parallel_ $ map download imgs
stopGlobalPool