- What Is Hxt?
- Why Hxt?
- Hello World
- Understanding Arrows
- Getting Started
- Parse A String As Html
- Arrow Interlude #1: Hxt Arrows
- Extracting Content
- Pretty-printing
- Selecting Elements
- Arrow Interlude #2
- Children And Descendents
- Working With Text
- Modifying A Node
- Modifying Children
- Conditionals (ifa)
- More Conditionals (when, Guards, And Filtera)
- Using Functions As Predicates
- Using Haskell Functions
- Working With Lists
- Introducing Handsomesoup
- Avoiding Io
- Debugging
- Epilogue
Contents
Working With Html In Haskell
updated: April 27, 2012
This is a complete guide to using HXT for parsing and processing HTML in Haskell.
What is HXT?
HXT is a collection of tools for processing XML with Haskell. It's a complex beast, but HXT is powerful and flexible, and very elegant once you know how to use it.
Why HXT?
Here's how HXT stacks up against some other XML parsers:
HXT vs TagSoup
TagSoup is the crowd favorite for HTML scraping in Haskell, but it's a bit too basic for my needs.
HXT vs HaXml
HXT is based on HaXml. The two are very similar, but I think HXT is a little more elegant.
HXT vs hexpat
hexpat is a high-performance xml parser. It might be more appropriate depending on your use case. hexpat lacks a collection of tools for processing the HTML, but you can try Parsec for that bit.
HXT vs xml (Text.XML.Light)
I haven't used Text.XML.Light. If you have used it and liked it, please let me know!
The one thing all these packages have in common is poor documentation.
Hello World
To whet your appetite, here's a simple script that uses HXT to get all links on a page:
import Text.XML.HXT.Core
main = do
html <- readFile "test.html"
let doc = readString [withParseHTML yes, withWarnings no] html
links <- runX $ doc //> hasName "a" >>> getAttrValue "href"
mapM_ putStrLn links
Understanding Arrows
I don't assume any prior knowledge of Arrows. In fact, one of the goals of this guide is to help you understand Arrows a little better.
The Least You Need to Know About Arrows
Arrows are a way of representing computations that take an input and return an output. All Arrows take a value of type a
and return a value of type b
. All Arrow types look like Arrow a b
:
-- an Arrow that takes an `a` and returns a `b`:
arrow1 :: SomeType a b
-- an Arrow that takes a `b` and returns a `c`:
arrow2 :: SomeType b c
-- an Arrow that takes a `String` and returns an `Int`:
arrow3 :: SomeType String Int
Arrows sound like functions! In fact, functions are arrows.
-- a function that takes an Int and returns a Bool
odd :: Int -> Bool
-- also, an Arrow that takes an Int and returns a Bool
odd :: (->) Int Bool
Don't get confused by the two different type signatures! Int -> Bool
is just the infix way of writing (->) Int Bool
.
Arrow Composition
You'll be using >>>
a lot with HXT, so it's a good idea to understand how it works.
>>>
composes two arrows into a new arrow.
We could compose length
and odd
like so: odd . length
.
Since functions are Arrows, we could also compose them like so: length >>> odd
or odd <<< length
.
They're all exactly the same!
ghci> odd . length $ [1, 2, 3]
True
ghci> length >>> odd $ [1, 2, 3]
True
ghci> odd <<< length $ [1, 2, 3]
True
A function is the most basic type of arrow, but there are many other types. HXT defines its own Arrows, and we will be working with them a lot.
Let's get started. Don't worry if Arrows still seem unclear. We will be writing a lot of examples, so they should become clear soon enough.
Getting Started
Step 1: Install HXT:
cabal install hxt
Step 2: Install HandsomeSoup:
cabal install HandsomeSoup
HandsomeSoup contains a powerful css
function that will allow us to access elements using css selectors. We will use this function until we can write a basic version of it ourselves as explained here. For more info about HandsomeSoup, see this section.
Step 3: Here's the HTML we'll be working with:
<html><head><title>The Dormouse's story</title></head>
<body>
<p class='title'><b>The Dormouse's story</b></p>
<p class='story'>Once upon a time there were three little sisters; and their names were
<a href='http://example.com/elsie' class='sister' id='link1'>Elsie</a>,
<a href='http://example.com/lacie' class='sister' id='link2'>Lacie</a> and
<a href='http://example.com/tillie' class='sister' id='link3'>Tillie</a>;
and they lived at the bottom of a well.</p>
<p class='story'>Some text</p>
</body>
</html>
Save it as test.html
.
Step 4: Import HXT, HandsomeSoup, and the html file into ghci
:
import Text.XML.HXT.Core
import Text.HandsomeSoup
html <- readFile "test.html"
Parse a String as HTML
Use readString
:
ghci> let doc = readString [withParseHTML yes, withWarnings no] html
doc
is now a parsed HTML document, ready to be processed!
Now we can do things like getting all links in the document:
ghci> doc >>> css "a"
Arrow Interlude #1: HXT Arrows
You just used your first Arrow! css
is an Arrow. Here's its type:
ghci> :t css
css :: ArrowXml a => String -> a XmlTree XmlTree
So css
takes an XmlTree
and returns another XmlTree
. A lot of Arrows in HXT have this type: they all transform the current tree and return a new tree.
Extracting Content
doc
is wrapped in an IOStateArrow
. If you try to see the contents of doc
, you'll get an error:
<interactive>:1:1:
No instance for (Show (IOSLA (XIOState s0) a0 XmlTree))
arising from a use of `print'
Possible fix:
add an instance declaration for
(Show (IOSLA (XIOState s0) a0 XmlTree))
In a stmt of an interactive GHCi command: print it
Use runX
to extract the contents.
contents <- runX doc
print contents
Prints out:
[NTree (XTag "/" [NTree (XAttr "transfer-Status")...]
Pretty-printing
I don't want to see ugly Haskell types. Let's use xshow
to convert our tree to HTML:
res <- runX . xshow $ doc
mapM_ putStrLn res
Prints out:
</ transfer-Status="200" transfer-Message="OK" transfer-URI="string:" source=""<html><head><title>The Dormouse's story</titl..."" transfer-Encoding="UNICODE"><html>
<head>
<title>The Dormouse's story</title>
...
Much better! Now use indentDoc
to add proper indentation:
res <- runX . xshow $ doc >>> indentDoc
mapM_ putStrLn res
Prints out:
</ transfer-Status="200" transfer-Message="OK" transfer-URI="string:" source=""<html><head><title>The Dormouse's story</titl..."" transfer-Encoding="UNICODE"><html>
<head>
<title>The Dormouse's story</title>
...
Perfect.
Selecting Elements
Note: To keep things simple for now, these examples make use of our custom css
Arrow.
Get all a
tags
doc >>> css "a"
Check if those links have an id
attribute
doc >>> css "a" >>> hasAttr "id"
Get all values for an attribute
doc >>> css "a" >>> getAttrValue "href"
See how easy it is to chain transformations together using >>>
? Notice how using getAttrValue
gets us the links for all a
tags, instead of just one:
ghci>runX $ doc >>> css "a" >>> getAttrValue "href"
["http://example.com/elsie","http://example.com/lacie","http://example.com/tillie"]
This is a core idea behind HXT. In HXT, everything you do is a series of transformations on the whole tree. So you can use getAttrValue
and HXT will automatically apply it to all the elements.
Get all links that have a particular id
doc >>> css "a" >>> hasAttrValue "id" (== "link1")
Get multiple values at once
Use <+>
:
-- get all p tags as well as all a tags
doc >>> css "p" <+> css "a"
Get all element names
doc //> hasAttr "id" >>> getElemName
We used the special function "//>
" here! It's covered in this section.
Get all elements where the text contains "mouse"
import Data.List
runX $ doc //> hasText (isInfixOf "mouse")
Get the element's name and the value of id
ghci> runX $ doc //> hasAttr "id" >>> (getElemName &&& getAttrValue "id")
[("a","link1"),("a","link2"),("a","link3")]
Let's talk about the &&&
function.
Arrow Interlude #2
&&&
is a function for Arrows. The best way to see how it works is by example:
ghci> length >>> (odd &&& (+1)) $ ["one", "two", "twee"]
(True,4)
&&&
takes two arrows and creates a new arrow. In the above example, the output of length
is fed into both odd
and (+1)
, and both return values are combined into a tuple (True, 4)
.
We used &&&
to get an element's name and its id: (getElemName &&& getAttrValue "id")
.
Why is this function useful? Suppose we want to get all attributes on links:
runX $ doc >>> css "a" >>> getAttrl >>> getAttrName
Here's where it's nice to have &&&
. The above line gives you something like this:
["href","class","id","href","class","id","href","class","id"]
The only problem: you have no idea what element each attribute belongs to! Use &&&
to get a reference to the element as well:
ghci> runX $ doc >>> css "a" >>> (this &&& (getAttrl >>> getAttrName))
[(...some element..., "href"), (...another element..., "class")..etc..]
HXT has lots of other arrows for selecting elements. See the docs for more.
Children and Descendents
HXT has a few different functions for working with children, and it can be tricky to decide which one to use.
So far we have been using the css
function to get elements. Now let's see how we could implement a basic version of it:
css tag = multi (hasName tag)
css
uses hasName
to get elements with a given tag. Why don't we just use hasName
instead of css
?
ghci> runX $ doc >>> hasName "a"
[]
hasName
only works on the current node, and ignores its descendents, whereas css
allows us to look in the entire tree for elements. Here are some arrows for looking in the entire tree:
getChildren and multi
We could use getChildren
to get the immediate child nodes:
ghci>runX $ doc >>> getChildren >>> getName
["html"]
But what if we want the names of all descendents, not just the immediate child node? Use multi
:
ghci> runX $ doc >>> multi getName
["/","html","head","title","body","p","b","p","a","a","a","p"]
multi
recursively applies an Arrow to an entire subtree. css
uses multi
to search across the entire tree for nodes.
deep and deepest
These two Arrows are related to multi
.
deep
recursively searches a whole tree for subtrees, for which a predicate holds. The search is performed top down. When a tree is found, this becomes an element of the result list. The tree found is not further examined for any subtress, for which the predicate also could hold:
-- deep successfully got the name of the root element,
-- so it didn't go through the child nodes of that element.
ghci> runX $ doc >>> deep getName
["/"]
-- here, deep will get all p tags but it won't look for
-- nested p tags (multi *will* look for nested p tags)
ghci>runX $ doc >>> deep (hasName "p") >>> getName
["p","p","p"]
deepest
is similar to deep
but performs the search from the bottom up:
ghci> runX $ doc >>> deepest getName
["title","b","a","a","a","p"]
/>
and //>
/>
looks for a direct child (i.e. what getChildren
does).
//>
looks for a node somewhere under this one (i.e. what deep
does).
So, these two lines are equivalent:
doc /> getText
doc >>> getChildren >>> getText
And these two lines are equivalent:
doc //> getText
doc >>> getChildren >>> (deep getText)
See docs for more.
Working With Text
Get the text in an element
ghci>runX $ doc >>> css "title" /> getText
["The Dormouse's story"]
Remember, this is the same as writing:
runX $ doc >>> multi (hasName "title") >>> getChildren >>> getText
Get the text in an element + all its descendents
doc >>> css "body" //> getText
Try using />
instead of //>
. What do you get?
Get All Links + Their Text
The wrong way:
ghci> runX $ doc >>> css "a" >>> (getAttrValue "href" &&& getText)
[]
This returns []
because doc >>> css "a" >>> getText
returns []
.
We need to go deeper! (i.e. use deep
):
ghci> runX $ doc >>> css "a" >>> (getAttrValue "href" &&& (deep getText))
[("http://example.com/elsie","Elsie"),("http://example.com/lacie","Lacie"),("http://example.com/tillie","Tillie")]
Remove Whitespace
Use removeAllWhiteSpace
. It removes all nodes containing only whitespace.
runX $ doc >>> css "body" >>> removeAllWhiteSpace //> getText
If you have used BeautifulSoup, this is kinda like the stripped_strings
method.
Modifying a Node
Modifying text
Use changeText
. Here's how you uppercase all the text in p
tags:
import Data.Char
uppercase = map toUpper
runX . xshow $ doc >>> css "p" /> changeText uppercase
Add or change an attribute
Use addAttr
:
runX . xshow $ doc >>> css "p" >>> addAttr "id" "my-own-id"
Modifying Children
processChildren
and processTopDown
allow you to modify the children of an element.
Add an id to the children of the root node
-- adds an id to the <html> tag
runX . xshow $ doc >>> processChildren (addAttr "id" "foo")
Add an id to all descendents of the root node
-- adds an id to all tags
runX . xshow $ doc >>> processTopDown (addAttr "id" "foo")
processChildren
is similar to getChildren
, except that instead of returning the children, it modifies them in place and returns the entire tree.
processTopDown
is similar to multi
.
processTopDownUntil
is similar to deep
.
Conditionals (ifA)
HXT has some useful functions that allow us to apply Arrows based on a predicate.
Using ifA
:
ifA
is the if statement for Arrows. It's used as ifA (predicate Arrow) (do if true) (do if false)
.
Uppercase all the text for p
tags only:
runX . xshow $ doc >>> processTopDown (ifA (hasName "p") (getChildren >>> changeText uppercase) (this))
We use the identity arrow this
here. You can read this as: if the element is a p
tag, uppercase it, otherwise pass it through unchanged.
this
has a complementary arrow called none
. none
is the zero arrow. Here's how we can use none
to remove all p
tags:
runX $ doc >>> processTopDown (ifA (hasName "p") (none) (this))
More Conditionals (when, guards, and filterA)
when
and guards
can make your ifA
code easier to read.
Uppercasing text for p
tags using when
instead of ifA
runX . xshow $ doc >>> processTopDown ((getChildren >>> changeText uppercase) `when` hasName "p")
f `when` g -- when the predicate `g` holds, `f` is applied, else the identity filter `this`.
Deleting all p
tags using guards
runX $ doc >>> processTopDown (neg (hasName "p") `guards` this)
g `guards` f -- when the predicate `g` holds, `f` is applied, else `none`.
Deleting all p
tags using filterA
runX $ doc >>> processTopDown (filterA $ neg (hasName "p"))
filterA f -- a shortcut for f `guards` this
Using Functions as Predicates
How would we get all nodes that have "mouse" in the text? Here's one way:
runX $ doc //> hasText (isInfixOf "mouse")
But if the hasText
function didn't exist, we could write it ourselves! Here's how:
First, import Text.XML.HXT.DOM.XmlNode
. It defines several functions that work on Nodes.
import qualified Text.XML.HXT.DOM.XmlNode as XN
(Note the qualified import...this module has a lot of names that conflict with HXT.Core
).
Here's a function that returns true if the given node's text contains "mouse":
import Data.Maybe
import Data.List
hasMouse n = "mouse" `isInfixOf` text
where text = fromMaybe "" (XN.getText n)
isA
lifts a predicate function to an HXT Arrow. Combined with isA
, we can use hasMouse
to filter out all nodes that don't have mouse
as part of their text:
runX $ doc //> isA hasMouse
We can use isA
wherever a predicate Arrow is needed: ifA
, when
, guards
etc.
See the docs for more conditionals for Arrows.
See these docs for more functions you can use to write your own Arrows.
Using Haskell Functions
Suppose we have an array of link texts:
ghci>runX $ doc >>> css "a" //> getText
["Elsie","Lacie","Tillie"]
And we want to get the length of each bit of text. So we need an arrow version of the length
function.
We can lift the length
function into an HXT arrow using arr
:
ghci> runX $ doc >>> css "a" //> getText >>> arr length
[5,5,6]
Note how length automatically gets applied to each element without us having to use map
. This is because Arrows in HXT always apply to the entire tree, not just one node. This behaviour is abstracted away so that you can just write a function that works on one node and have it apply to every node in the tree automatically.
Working With Lists
This section was written after Ywen asked this question on Reddit.
So far, we have applied arrows to one node at a time. In the previous section, we applied length
to every node individually. What if we wanted to work with all the nodes at once, to do a map
or a foldl
over them?
HXT has some special functions that allow you to work on the entire list of elements, instead of working on just one element.
>>.
and >.
We already know how to get the text for all links:
ghci> runX $ doc >>> css "a" //> getText
["Elsie","Lacie","Tillie"]
How do we get the text with the results reversed? Use >>.
:
ghci> runX $ (doc >>> css "a" //> getText) >>. reverse
["Tillie","Lacie","Elsie"]
>>.
takes a function that takes a list, and returns a list, so it allows us to use all our Haskell list functions.
We could sort all the letters in the names:
ghci> import Data.List
ghci> runX $ (doc >>> css "a" //> getText) >>. (map sort)
["Eeils","Lacei","Teiill"]
How do we count the number of links in the doc? Use >.
:
ghci> runX $ (doc >>> css "a" //> getText) >. length
[3]
>.
takes a function that takes a list and returns a single value.
Getting the length of the text of all links combined:
ghci> runX $ (doc >>> css "a" //> getText >>. concat) >. length
[16]
The parentheses are important here!
-- Counts the number of links in the doc
ghci> runX $ (doc >>> css "a" //> getText) >. length
[3]
-- Oops! Runs `>. length` on each link individually
ghci> runX $ doc >>> css "a" //> getText >. length
[1,1,1]
Introducing HandsomeSoup
HandsomeSoup is an extension for HXT that provides a complete CSS2 selector implementation, so you can complicated selectors like:
doc >>> css "h1#title"
doc >>> css "li > a.link:first-child"
doc >>> css "h2[lang|=en]"
...or any other valid CSS2 selector. Here are some other goodies it provides:
Getting Attributes With HandsomeSoup
Use !
instead of getAttrValue
:
doc >>> css "a" ! "href"
Scraping Online Pages
Use fromUrl
to download and parse pages:
doc <- fromUrl url
links <- runX $ doc >>> css "a" ! "href"
Downloading Content
Use openUrl
:
content <- runMaybeT $ openUrl url
case content of
Nothing -> putStrLn $ "Error: " ++ url
Just content' -> writeFile "somefile" content'
Parse Strings
Use parseHtml
:
contents <- readFile [filename]
doc <- parseHtml contents
Avoiding IO
Look at the type of our html tree:
ghci>:t doc
doc :: IOSArrow XmlTree (NTree XNode)
It's in IO! This means that any function that parses the html will have to be IO. What if you want a pure function for parsing the html?
You can use hread
:
-- old way:
ghci> let old = runX doc
-- using hread:
ghci> let new = runLA hread contents
And here are their types:
ghci> :t old
old :: IO [XmlTree] -- IO!
ghci> :t new
new :: [XmlTree] -- no IO!
An Example: Getting All Links
ghci> runLA (hread >>> css "a" //> getText) contents
["Elsie","Lacie","Tillie"]
So why haven't we been using hread
? Because IOSArrow
is much more powerful; it gives you IO + State.
hread
is also much more stripped down. From the docs:
parse a string as HTML content, substitute all HTML entity refs and canonicalize tree. (substitute char refs, ...). Errors are ignored. This is a simpler version of readFromString without any options.
Debugging
HXT provides arrows to print out the current tree at any time. These arrows are very handy for debugging.
Use traceTree
:
doc >>> css "h1" >>> withTraceLevel 5 traceTree >>> getAttrValue "id"
traceTree
needs level >= 4.
Use traceMsg
for sprinkling printf-like statements:
doc >>> css "h1" >>> traceMsg 1 "got h1 elements" >>> getAttrValue "id"
See the docs for even more trace functions.
Epilogue
I hope you found this guide helpful in your quest to work with HTML using Haskell.
Key Modules For Working With HXT
Arrows for working with nodes (the core stuff).
Arrows for working with children.
Function versions of most Arrows (Useful with arr
or isA
).