Drilling Down With Beautiful Soup

Want to learn more? I recommend these Python books: Python for Data Analysis, Python Data Science Handbook, and Introduction to Machine Learning with Python.

# Import required modules
import requests
from bs4 import BeautifulSoup
import pandas as pd

Download the HTML and create a Beautiful Soup object

# Create a variable with the URL to this tutorial
url = 'http://en.wikipedia.org/wiki/List_of_A_Song_of_Ice_and_Fire_characters'

# Scrape the HTML at the url
r = requests.get(url)

# Turn the HTML into a Beautiful Soup object
soup = BeautifulSoup(r.text, "lxml")

If we looked at the soup object, we'd see that the names we want are in a heirarchical list. In psuedo-code, it looks like:

  • class=toclevel-1 span=toctext
    • class=toclevel-2 span=toctext CHARACTER NAMES
    • class=toclevel-2 span=toctext CHARACTER NAMES
    • class=toclevel-2 span=toctext CHARACTER NAMES
    • class=toclevel-2 span=toctext CHARACTER NAMES
    • class=toclevel-2 span=toctext CHARACTER NAMES

To get the CHARACTER NAMES, we are going to need to drill down to grap into loclevel-2 and grab the toctext

Setting up where to put the results

# Create a variable to score the scraped data in
character_name = []

Drilling down with a forloop

# for each item in all the toclevel-2 li items
# (except the last three because they are not character names),
for item in soup.find_all('li',{'class':'toclevel-2'})[:-3]:
    # find each span with class=toctext,
    for post in item.find_all('span',{'class':'toctext'}):
        # add the stripped string of each to character_name, one by one
        character_name.append(post.string.strip())

The results

# View all the character names
character_name
['Eddard Stark',
 'Catelyn Tully',
 'Robb Stark',
 'Sansa Stark',
 'Arya Stark',
 'Bran Stark',
 'Rickon Stark',
 'Jon Snow',
 'Benjen Stark',
 'Lyanna Stark',
 'Theon Greyjoy',
 'Roose Bolton',
 'Ramsay Bolton',
 'Hodor',
 'Osha',
 'Jeyne Poole',
 'Jojen and Meera Reed',
 'Jeyne Westerling',
 'Daenerys Targaryen',
 'Viserys Targaryen',
 'Rhaegar Targaryen',
 'Aegon V Targaryen',
 'Aerys II Targaryen',
 'Aegon VI Targaryen',
 'Jon Connington',
 'Jorah Mormont',
 'Brynden Rivers',
 'Missandei',
 'Daario Naharis',
 'Grey Worm',
 'Jon Arryn',
 'Lysa Arryn',
 'Robert Arryn',
 'Yohn Royce',
 'Tywin Lannister',
 'Cersei Lannister',
 'Jaime Lannister',
 'Joffrey Baratheon',
 'Myrcella Baratheon',
 'Tommen Baratheon',
 'Tyrion Lannister',
 'Kevan Lannister',
 'Lancel Lannister',
 'Bronn',
 'Gregor Clegane',
 'Sandor Clegane',
 'Podrick Payne',
 'Robert Baratheon',
 'Stannis Baratheon',
 'Selyse Baratheon',
 'Shireen Baratheon',
 'Melisandre',
 'Davos Seaworth',
 'Renly Baratheon',
 'Brienne of Tarth',
 'Beric Dondarrion',
 'Gendry',
 'Balon Greyjoy',
 'Asha Greyjoy',
 'Euron Greyjoy',
 'Victarion Greyjoy',
 'Aeron Greyjoy',
 'Doran Martell',
 'Arianne Martell',
 'Quentyn Martell',
 'Trystane Martell',
 'Elia Martell',
 'Oberyn Martell',
 'Ellaria Sand',
 'The Sand Snakes',
 'Areo Hotah',
 'Hoster Tully',
 'Edmure Tully',
 'Brynden Tully',
 'Walder Frey',
 'Mace Tyrell',
 'Loras Tyrell',
 'Margaery Tyrell',
 'Olenna Tyrell',
 'Randyll Tarly',
 'Jeor Mormont',
 'Maester Aemon',
 'Yoren',
 'Samwell Tarly',
 'Janos Slynt',
 'Alliser Thorne',
 'Mance Rayder',
 'Ygritte',
 'Craster',
 'Gilly',
 'Val',
 'Lord of Bones',
 'Bowen Marsh',
 'Eddison Tollett',
 'Tormund Giantsbane',
 'Petyr Baelish',
 'Varys',
 'Pycelle',
 'Barristan Selmy',
 'Arys Oakheart',
 'Ilyn Payne',
 'Qyburn',
 'The High Sparrow',
 'Khal Drogo',
 'Syrio Forel',
 "Jaqen H'ghar",
 'Illyrio Mopatis',
 'Thoros of Myr',
 'Ser Duncan the Tall',
 'Hizdahr zo Loraq',
 'Yezzan zo Qaggaz',
 'Tycho Nestoris',
 'The Waif',
 'Septa Unella']

Quick analysis: Which house has the most main characters?

# Create a list object where to store the for loop results
houses = []
# For each element in the character_name list,
for name in character_name:
    # split up the names by a blank space and select the last element
    # this works because it is the last name if they are a house,
    # but the first name if they only have one name,
    # Then append each last name to the houses list
    houses.append(name.split(' ')[-1])
# Convert houses into a pandas series (so we can use value_counts())
houses = pd.Series(houses)

# Count the number of times each name/house name appears
houses.value_counts()
Baratheon     8
Stark         8
Targaryen     6
Greyjoy       6
Lannister     6
Martell       6
Tyrell        4
Tully         4
Arryn         3
Clegane       2
Bolton        2
Mormont       2
Payne         2
Tarly         2
Melisandre    1
Giantsbane    1
Ygritte       1
Bronn         1
Westerling    1
Sand          1
Osha          1
Gendry        1
Sparrow       1
Drogo         1
Qyburn        1
Gilly         1
Pycelle       1
Craster       1
H'ghar        1
Oakheart      1
             ..
Rivers        1
Seaworth      1
Marsh         1
Connington    1
Hodor         1
Val           1
Unella        1
Aemon         1
Myr           1
Slynt         1
Dondarrion    1
Baelish       1
Qaggaz        1
Yoren         1
Mopatis       1
Worm          1
Varys         1
Royce         1
Nestoris      1
Tarth         1
Naharis       1
Snakes        1
Reed          1
Bones         1
Tollett       1
Rayder        1
Tall          1
Selmy         1
Hotah         1
Snow          1
dtype: int64