Web Scraping with Beautiful Soup in Python

For the last paper I wrote, I needed to put my data set together from a variety of sources from the internet. I had approximately 2000 observations and 40 variables for each observation. Using conservative estimates, it probably would’ve taken me at least 500 hours to put it all together by hand. I had no intentions of spending that much time on it though so I learned to use the Beautiful Soup library for Python. In all, I spent probably 40 hours learning the library and assembling all of my data.

Web scraping is a wonderful tool. Basically you write a script which goes through a web page’s HTML source code and then extracts the data you want to a file. I had around 10,000 player pages on TSN to scrape and around 2,500 on Capgeek to scrape. For extracting just one page of data, it may be faster to do it by hand but when you have so much data to go through you need to find a way to automate it. Today I will walk through one of the scripts I wrote.

This script is going to extract all of the attendance data from 2006-2011 from ESPN. Go to this page to see what we’ll be working with. For starters you’ll need Python (I use version 2.7) and Beautiful Soup. Python can be downloaded here and Beautiful Soup can be download from here. Once you install Python and BS4, you’re ready to begin. Open your Python GUI and click File -> New Window.

The first thing we need to do is tell Python which libraries we’ll be using.

import urllib2
from bs4 import BeautifulSoup

The first library urllib2 is an HTML parser library that is built into python. It reads the HTML code. The second library is Beautiful Soup which we will take that parsed HTML and make it pretty and searchable.

Next what we want to do is open a file to work with. This is where we will save our extracted data.

f = open(‘homeattendance.txt’, ‘w’)

f is just some variable, you can call the file anything. Since I didn’t specify where on the hard drive ‘homeattendance.txt’ should go, it will be created in the same directory as the code file. Next, we want to figure out what pages we want to parse.

for number in range (2006, 2012):

link = “http://espn.go.com/nhl/attendance/_/year/” + str(number)
page = urllib2.urlopen(link)
soup = BeautifulSoup(page)

I want attendance information from 2006 until 2011. The for loop will start at 2006, run through all the code and then do it again for 2007. It will be do this for all numbers up to and including 2011. The way stepping works for a for loop in Python is that it stops at the end of your range so always add +1 to however many numbers you need. I looked at the ESPN links and saw they’re broken into two parts: http://espn.go.com/nhl/attendance/_/year/ and the year. I create a variable link that will be composed of “http://espn.go.com/nhl/attendance/_/year/” and the number that we’re on in the loop. Since the URL links are strings, I have to convert the number into a string which is what the str() function does.

Next, I parse the page and save this to a variable called page. urllib2.urlopen() parses the HTML code of the link in the brackets. Finally, I use a new variable called soup to hold all the Beautiful Soup version of the parsed code. Once these commands are done, the HTML code is in the soup variable and we just need to search it to find what data we want. This next part will depend on the page you’re working with. You’ll need to search the HTML code of the page you want to scrape and figure out how’s it coded to get the data. Looking at the HTML code for ESPN, I can see that the data I want is in the first table on the page.

table = soup.find(‘table’)

I create a new variable table that will hold all of the parsed HTML for just the table. The .find() command in Beautiful Soup looks for the first instance of the tag in the brackets. I want the first table command so I use .find(‘table’) on the soup variable to search soup for the first table tag. Now the table I want is in the table variable so I just need to get the data out of it.

for row in table.find_all(“tr”, { “class” : [“oddrow” , “evenrow”]}):

col = row.find_all(‘td’)

These next lines are kind of confusing. I want to find all the rows in the table but I only want the rows with data, I don’t want the headings. Looking through ESPN’s code I notice the the data rows all either use a class called oddrow or evenrow. I’m going to use the find_all() function to find all tr (table row) tags within my table. table.find_all(‘tr’) would fine every tr tag and save it in an array but it would also include heading rows (which will break this code actually). You can further specify attributes for tags you search for. table.find_all(“tr”, width=”100%”) would find all tr tags with a width of 100% for example. I want to find multiple types of attributes so I have to pass a list to the find_all function. find_all(“tr”, {“class” : [“oddrow”, “evenrow”]}) finds all tr tags within the table that contain oddrow or evenrow in their class attribute.

The for loop in this case tells the computer to go through each row that it finds in the table. For each row, we want to find the columns. This is easy enough. We just create an array named col that will be all the td tags found in the row we’re working with. All the data we actually want is in the col array.

team = col[1].string
attendance = col[4].string
capacity = col[5].string

Looking on the ESPN page, I want the team name, their home attendance, and the percentage of capacity. The team name is in the second column. Python starts arrays at 0 so the first column is in the 0th position of the col array, the second is in the 1st position, etc. We want the data held in the 1st, 4th, and 5th positions of the col array (which correspond to the 2nd, 5th, and 6th columns). To do this for the team name, we set a variable equal to col[1].string. This will extract whatever string data is in the 1st position of the col array. Now we have the 3 pieces of data we want from each row so we just need to write it to the file we opened earlier.

f.write(team + ‘\t’ + str(number) + ‘\t’ + attendance + ‘\t’ + capacity + ‘\n’)

We want to write the team name, the year, the home attendance, and the capacity percentage to our text file. The purpose of the ‘\t’ is to put a tab in between each number. This will help excel or whatever statistical package you use tell what each variable is. The ‘\n’ at the end of the line starts a new line. We only want one row per line so we need to tell Python when to start a new line.

Now we’re basically done. The computer will go row by row for each year and write all the data to our file. There’s only last command to include.

f.close()

This basically just tells the computer to save the file we’ve been working with.

All of this probably seems like a lot of work to get the data from 1 table on 6 pages but when you need to get 4 tables from 10,000 pages, this will save you a tremendous amount of time.

The tabs are very important for getting this program to run properly so I’ve reproduced the code in full below.

import urllib2
from bs4 import BeautifulSoup

f = open(‘homeattendance.txt’, ‘w’)

for number in range (2006, 2012):

link = “http://espn.go.com/nhl/attendance/_/year/” + str(number)
page = urllib2.urlopen(link)
soup = BeautifulSoup(page)

table = soup.find(‘table’)

for row in table.find_all(“tr”, { “class” : [“oddrow” , “evenrow”]}):

col = row.find_all(‘td’)

team = col[1].string
attendance = col[4].string
capacity = col[5].string

f.write(team + ‘\t’ + str(number) + ‘\t’ + attendance + ‘\t’ + capacity + ‘\n’)

f.close()

Shane's Site

Hockey, Economics and Taxes

Web Scraping with Beautiful Soup in Python

One comment on “Web Scraping with Beautiful Soup in Python”

Leave a comment Cancel reply

Share this:

Related

One comment on “Web Scraping with Beautiful Soup in Python”

Leave a comment Cancel reply