HTML scraping in Python

Sometimes it’s nice to be able to read the information of a website programmatically — sometimes referred to as screen-scraping. This is the kind of thing I knock up occasionally to do one-off things like “find all the line painting companies in Torquay, outputting to a CSV” or “guess how many people have bought a particular title for Xbox Live by looking at the high score tables.”

When doing so, it’s tempting to think you can just use some regular expressions and find the data in the HTML source you need. But often this isn’t realistic, and it’s better if you could parse the document’s HTML yourself. However, this is usually non-trivial — the HTML you find on websites is usually very broken.

To the rescue come a couple of really handy libraries — html5lib from Google, and Beautiful Soup. Both have support for parsing broken HTML, and turning it into a “best guess” of what the author actually meant. Both can generate a DOM and allow you to fairly easily find the stuff you need.

I’ve not actually used Beautiful Soup, but its interface looks a little more easy to use than the html5lib one.

A quick example, taken from a script which has a look through the above-mentioned high-score tables, looking for rows of data:

# Create an HTML parser which creates DOMs.
parser = html5lib.HTMLParser(tree=dom.TreeBuilder)
# Parse the source.
tree = parser.parse(source)
# Normalise the tree.  This basically cleans up the
# text nodes inside which makes our life easier.
tree.normalize()

# Create a list of all the rows (that is, all <div>
# tags with the class of "row")
rows = [div for div in tree.getElementsByTagName("div")\
        if div.getAttribute("class").strip() == "row"]

# Go on to process the rows here...

HTML scraping in Python

About Matt Godbolt