Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2021
Christopher Potts

Basic web scraping

This is an introduction to web scraping using Requests and Beautiful Soup. It's meant to get you started, so that you can use the documentation for these libraries to do more advanced work.

For lightweight page traversal, you might be able to get by with just Requests, which has overlapping functionality with Beautiful Soup.

If you plan to build up a lasting scraping infrastructure for your project, then Scrapy is worth considering.

Set-up

In [1]:
from bs4 import BeautifulSoup as Soup
import os
import re
import requests
In [2]:
# The next cell will ensure that Jupyter prints all output without creating a
# scrolling cell. Feel free to remove this cell if you prefer to have scrolling
# for long output.
In [3]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

Basic downloading

The Requests library also provides intuitive methods for dealing with cookies, authentication tokens, log-ins, different page encodings, and so forth. Here, we'll use it just to get the content of web pages on the open Web.

In [4]:
def downloader(link):
    """
    Download `link` and set the encoding to UTF8.
    
    Parameters
    ----------
    link : str
        The location of the page to download
        
    Returns
    -------
    str        
    """
    req = requests.get(link)
    req.encoding = "utf8"
    return req.text
In [5]:
contents = downloader("http://web.stanford.edu/class/linguist278/data/scrapeme.html")
In [6]:
print(contents)
<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <title>Scrape this page!</title>
</head>

<body>

<h1>Scrape this page!</h1>

<p class="intro">Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p>

<p class="main-text">Phasellus viverra nulla ut metus varius laoreet.</p>

<p class="main-text">Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.
<A HREF="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">Nullam
accumsan lorem in dui</A>. Cras ultricies mi eu turpis hendrerit
fringilla.</p>

<p id="conclusion">Sed lectus. Donec mollis hendrerit
risus.<A HREF='http://www.python.org/doc/essays/list2str.html'>Donec
posuere vulputate arcu</A>. Phasellus accumsan cursus velit.</p>

</body>
</html>

Beautiful Soup objects

BeautifulSoup objects are structured representations of webpages. They have some properties of nested dicts, but you can also perform different kinds of traversal of the structure.

In [7]:
soup = Soup(contents, "lxml")

This option gives a version of the string where the structure is made more apparent by neat indenting. This can be very helpful for large, complex, machine-generated webpages like the one you'll work with on the homework.

In [8]:
print(soup.prettify())
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Scrape this page!
  </title>
 </head>
 <body>
  <h1>
   Scrape this page!
  </h1>
  <p class="intro">
   Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
  </p>
  <p class="main-text">
   Phasellus viverra nulla ut metus varius laoreet.
  </p>
  <p class="main-text">
   Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.
   <a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">
    Nullam
accumsan lorem in dui
   </a>
   . Cras ultricies mi eu turpis hendrerit
fringilla.
  </p>
  <p id="conclusion">
   Sed lectus. Donec mollis hendrerit
risus.
   <a href="http://www.python.org/doc/essays/list2str.html">
    Donec
posuere vulputate arcu
   </a>
   . Phasellus accumsan cursus velit.
  </p>
 </body>
</html>

Accessing elements

In [9]:
soup.h1
Out[9]:
<h1>Scrape this page!</h1>
In [10]:
soup.h1.name
Out[10]:
'h1'
In [11]:
soup.head
Out[11]:
<head>
<meta charset="utf-8"/>
<title>Scrape this page!</title>
</head>
In [12]:
soup.head.meta
Out[12]:
<meta charset="utf-8"/>
In [13]:
soup.head.meta['charset']
Out[13]:
'utf-8'

Finding things

The most common method for finding things is with find_all:

In [14]:
# Gets all the <p> elements:

paragraphs = soup.find_all("p")
In [15]:
# Gets all the p elements with a "class" attribute with value "intro":
#
# <p class="intro">

soup.find_all("p", attrs={"class": "intro"})
Out[15]:
[<p class="intro">Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p>]

You can use regexs too!

In [16]:
soup.find_all(re.compile("^(p|a)$"))[: 3]
Out[16]:
[<p class="intro">Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p>,
 <p class="main-text">Phasellus viverra nulla ut metus varius laoreet.</p>,
 <p class="main-text">Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.
 <a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">Nullam
 accumsan lorem in dui</a>. Cras ultricies mi eu turpis hendrerit
 fringilla.</p>]
In [17]:
soup.find_all("p", attrs={"class": re.compile(r"(intro|main-text)")})
Out[17]:
[<p class="intro">Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</p>,
 <p class="main-text">Phasellus viverra nulla ut metus varius laoreet.</p>,
 <p class="main-text">Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.
 <a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">Nullam
 accumsan lorem in dui</a>. Cras ultricies mi eu turpis hendrerit
 fringilla.</p>]

See also children and descendants.

Getting the string from elements

If an element doesn't contain any other HTML tags, then the string attribute will give you the intuitive string content:

In [18]:
soup.h1.string
Out[18]:
'Scrape this page!'

The contents method is similar but always returns a list:

In [19]:
soup.h1.contents
Out[19]:
['Scrape this page!']

If the element contains any tags, then string will return None:

In [20]:
paragraphs[2]
Out[20]:
<p class="main-text">Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.
<a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">Nullam
accumsan lorem in dui</a>. Cras ultricies mi eu turpis hendrerit
fringilla.</p>
In [21]:
paragraphs[2].string

However, contents will return a list as before, mixing different kinds of elements:

In [22]:
para2 = paragraphs[2].contents
In [23]:
para2
Out[23]:
['Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.\n',
 <a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">Nullam
 accumsan lorem in dui</a>,
 '. Cras ultricies mi eu turpis hendrerit\nfringilla.']
In [24]:
para2[1]
Out[24]:
<a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">Nullam
accumsan lorem in dui</a>
In [25]:
para2[1]['href']
Out[25]:
'http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf'
In [26]:
para2[1].string
Out[26]:
'Nullam\naccumsan lorem in dui'

You can also use stripped_strings, which is a generator over all the strings (tags removed) inside the element; this is a fast way to extract the raw texts, with all tag soup strained off:

In [27]:
for s in paragraphs[2].stripped_strings:
    print("="*50)
    print(s)
==================================================
Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.
==================================================
Nullam
accumsan lorem in dui
==================================================
. Cras ultricies mi eu turpis hendrerit
fringilla.

Exercises

Using scrapeme.html

Walk down from the head down to the title and print its string contents

In [28]:
print(contents[: 100])
<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <title>Scrape this page!</title>
In [29]:
#<--
soup.head.title.string
#-->
Out[29]:
'Scrape this page!'

Get all of the paragraphs with class="main-text"

In [30]:
#<--
soup.find_all("p", attrs={"class": "main-text"})
#-->
Out[30]:
[<p class="main-text">Phasellus viverra nulla ut metus varius laoreet.</p>,
 <p class="main-text">Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus.
 <a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">Nullam
 accumsan lorem in dui</a>. Cras ultricies mi eu turpis hendrerit
 fringilla.</p>]

Links look like this:

<a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">text</a>

We want the 'href' value, not the text of the link.

In [31]:
#<--
links = []
for a in soup.find_all("a"):
    links.append(a['href'])
    
links    
#-->
Out[31]:
['http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf',
 'http://www.python.org/doc/essays/list2str.html']

Using toyhtml.html

This file is at

http://web.stanford.edu/class/linguist278/data/toyhtml.html

Read the page contents into a string

In [32]:
#<--
toy_contents = downloader("http://web.stanford.edu/class/linguist278/data/toyhtml.html")
#-->

Create a Soup object from the page contents

In [33]:
#<--
toy_soup = Soup(toy_contents, "lxml")
#-->

Pretty print the contents of the Soup

In [34]:
#<--
print(toy_soup.prettify())
#-->
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Some lists and a table
  </title>
  <meta content="lists of stuff" name="description"/>
  <meta content="groceries,articles,html" name="keywords"/>
  <meta content="noindex,nofollow" name="robots"/>
 </head>
 <body>
  <h1>
   Some lists and a table
  </h1>
  <p>
   Dive into the
   <a href="http://www.crummy.com/software/BeautifulSoup/" id="library">
    Soup
   </a>
   !
  </p>
  <h2>
   Groceries
  </h2>
  <ul class="groceries">
   <li>
    Milk
   </li>
   <li>
    Coffee
   </li>
   <li>
    <span style="color:blue; font-weight:bold;">
     Choco-Leibniz
    </span>
   </li>
  </ul>
  <h2>
   Quotes
  </h2>
  <ul class="quotes">
   <li>
    <em>
     He was a linguist, and therefore he had pushed the bounds of obstinacy well beyond anything that is conceivable to other men.
    </em>
    —Helen DeWitt
   </li>
   <li>
    <em>
     In mathematics you don't understand things. You just get used to them.
    </em>
    —Jon von Neumann
   </li>
   <li>
    <em>
     Never underestimate the joy people derive from hearing something they already know.
    </em>
    —Enrico Fermi
   </li>
  </ul>
  <h2>
   Books
  </h2>
  <ul class="books">
   <li>
    <a href="http://www.amazon.com/Convention-Philosophical-Study-David-Lewis/dp/0631232575">
     <em>
      Convention
     </em>
    </a>
    by
    <a href="http://en.wikipedia.org/wiki/David_Lewis_(philosopher)">
     David Lewis
    </a>
   </li>
   <li>
    Linguistics gossip:
    <ul>
     <li>
      <a href="http://www.amazon.com/The-Linguistics-Wars-ebook/dp/B004UP9AHK/">
       <em>
        The Linguistics Wars
       </em>
      </a>
      by Randy Allen Harris
     </li>
     <li>
      <a href="http://www.amazon.com/Western-Linguistics-An-Historical-Introduction/dp/0631208917/">
       <em>
        Western Linguistics
       </em>
      </a>
      by Pieter A. M. Seuren
     </li>
    </ul>
   </li>
  </ul>
  <h2>
   Movies in tabular format
  </h2>
  <table border="1" class="movies">
   <tr>
    <th>
     Title
    </th>
    <th>
     Director
    </th>
    <th>
     Year
    </th>
   </tr>
   <tr>
    <td>
     Lost in Translation
    </td>
    <td>
     Sofia Coppola
    </td>
    <td>
     2003
    </td>
   </tr>
   <tr>
    <td>
     The Royal Tenenbaums
    </td>
    <td>
     Wes Anderson
    </td>
    <td>
     2001
    </td>
   </tr>
   <tr>
    <td>
     There Will be Blood
    </td>
    <td>
     Paul Thomas Anderson
    </td>
    <td>
     2007
    </td>
   </tr>
  </table>
 </body>
</html>

Find all the h2 elements and print out their string contents

In [35]:
#<--
for elem in toy_soup.find_all("h2"):
    print(elem.string)
#-->
Groceries
Quotes
Books
Movies in tabular format

Use a regex to find all the elements header elements, and print their tag names

These are h1, h2, h3, h4, ...

In [36]:
for elem in toy_soup.find_all(re.compile("^h\d+$")):
    print(elem.name)
h1
h2
h2
h2
h2

Create a list of the string contents of each item in the 'groceries' list

In [37]:
#<--
groceries = []

for elem in toy_soup.find_all("ul", attrs={"class": "groceries"}):
    for li in elem.find_all("li"):
        groceries.append(" ".join(li.stripped_strings))

groceries
#-->
Out[37]:
['Milk', 'Coffee', 'Choco-Leibniz']

Movie data

Extract the movie table from the page and create a dict mapping movie titles to pairs (dir, year) where dir is the movie's director and year is an int giving its release year.

In [38]:
#<--
movie_data = {}

tbl = toy_soup.find("table", attrs={"class": 'movies'})

for tr in tbl.find_all("tr"):
    tds = tr.find_all("td")
    if tds:
        movie_data[tds[0].string] = (tds[1].text, int(tds[2].text))

movie_data
#-->
Out[38]:
{'Lost in Translation': ('Sofia Coppola', 2003),
 'The Royal Tenenbaums': ('Wes Anderson', 2001),
 'There Will be Blood': ('Paul Thomas Anderson', 2007)}
In [39]:
import time

def pdf_crawler(
        base_link="http://www.stanford.edu/class/linguist278/data/crawler/start.html", 
        output_dirname=".", 
        depth=0):
    """Crawls webpages looking for PDFs to download. The function branches out 
    from the original to the specified depth, and saves the files locally in 
    `output_dirname`. For each HTML page H linked from link, it follows H and 
    calls `pdf_crawler` on H. It continues this crawling to `depth`, supplied 
    by the user. `depth=0` just  does the user's page, `depth=1` follows the 
    links at the user's page,  downloads those files, but doesn't go further, 
    and so forth.

    The function is respectful in that it uses the time module to rest between 
    downloads, so that no one's server gets slammed:

    https://docs.python.org/3.6/library/time.html

    It avoids downloading files it already has; see `_download_pdf`, which is
    completed for you (but could be improved).
    
    The page search aspects of this can be done purely with Requests, but I used 
    BeautifulSoup.

    Parameters
    ----------
    base_link : str
        Page to start crawling from
    output_dirname : str
        Local directory in which to save the downloaded PDFs
    depth : int
        How many pages out from the source to go    
    """
    #<--
    # Make sure we get sensible depth values:
    if depth < 0:
        return None  
    page = downloader(base_link)
    soup = Soup(page, "lxml")
    base = os.path.split(base_link)[0]
    # Get the PDF links and download each:    
    for pdf in soup.find_all("a", attrs={"href": re.compile(r"\.pdf$")}):        
        _download_pdf(base, pdf['href'], output_dirname)
        time.sleep(1)
    # Get the HTML links:
    if depth > 0:
        for html in soup.find_all("a", attrs={"href": re.compile(r"\.html$")}):
            link = "{}/{}".format(base, html['href'])
            pdf_crawler(link, output_dirname, depth-1)
    #-->

        
def _download_pdf(base, filename, output_dirname):
    """Handles the PDF downloading part for `pdf_crawler`. This function checks
    to see whether it already has the file and avoids redownloading if it does,
    and it also checks that the file status code is 200, to avoid trying to
    download files that don't exist (status 404) or that are otherwise 
    unavailable.
    
    Parameters
    ----------
    base : str
        The base of the link (the Web directory that contains the file)
    filename : str
        The basename of the file
    output_dirname : str
        Local directory in which to save the downloaded PDF    
    """
    output_filename = os.path.join(output_dirname, filename)
    # Don't redownload a file we already have:
    if os.path.exists(output_filename):
        return None
    # PDF link:
    link = "{}/{}".format(base, filename)        
    req = requests.get(link)
    # Does the file exist?
    if req.status_code != 200:
        print("Couldn't download {}: status code {}".format(link, req.status_code))
    else:
        # Download if the file is present:
        with open(output_filename, 'wb') as f:
            f.write(req.content)        
In [40]:
pdf_crawler(output_dirname="tmp", depth=2)
Couldn't download http://www.stanford.edu/class/linguist278/data/crawler/broken.pdf: status code 404