Linguist 278: Programming for Linguists
Stanford Linguistics, Fall 2021
Christopher Potts
This is an introduction to web scraping using Requests and Beautiful Soup. It's meant to get you started, so that you can use the documentation for these libraries to do more advanced work.
For lightweight page traversal, you might be able to get by with just Requests, which has overlapping functionality with Beautiful Soup.
If you plan to build up a lasting scraping infrastructure for your project, then Scrapy is worth considering.
from bs4 import BeautifulSoup as Soup
import os
import re
import requests
# The next cell will ensure that Jupyter prints all output without creating a
# scrolling cell. Feel free to remove this cell if you prefer to have scrolling
# for long output.
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
The Requests library also provides intuitive methods for dealing with cookies, authentication tokens, log-ins, different page encodings, and so forth. Here, we'll use it just to get the content of web pages on the open Web.
def downloader(link):
"""
Download `link` and set the encoding to UTF8.
Parameters
----------
link : str
The location of the page to download
Returns
-------
str
"""
req = requests.get(link)
req.encoding = "utf8"
return req.text
contents = downloader("http://web.stanford.edu/class/linguist278/data/scrapeme.html")
print(contents)
BeautifulSoup objects are structured representations of webpages. They have some properties of nested dicts, but you can also perform different kinds of traversal of the structure.
soup = Soup(contents, "lxml")
This option gives a version of the string where the structure is made more apparent by neat indenting. This can be very helpful for large, complex, machine-generated webpages like the one you'll work with on the homework.
print(soup.prettify())
soup.h1
soup.h1.name
soup.head
soup.head.meta
soup.head.meta['charset']
The most common method for finding things is with find_all
:
# Gets all the <p> elements:
paragraphs = soup.find_all("p")
# Gets all the p elements with a "class" attribute with value "intro":
#
# <p class="intro">
soup.find_all("p", attrs={"class": "intro"})
You can use regexs too!
soup.find_all(re.compile("^(p|a)$"))[: 3]
soup.find_all("p", attrs={"class": re.compile(r"(intro|main-text)")})
See also children and descendants.
If an element doesn't contain any other HTML tags, then the string
attribute will give you the intuitive string content:
soup.h1.string
The contents
method is similar but always returns a list:
soup.h1.contents
If the element contains any tags, then string
will return None
:
paragraphs[2]
paragraphs[2].string
However, contents
will return a list as before, mixing different kinds of elements:
para2 = paragraphs[2].contents
para2
para2[1]
para2[1]['href']
para2[1].string
You can also use stripped_strings
, which is a generator over all the strings (tags removed) inside the element; this is a fast way to extract the raw texts, with all tag soup strained off:
for s in paragraphs[2].stripped_strings:
print("="*50)
print(s)
print(contents[: 100])
#<--
soup.head.title.string
#-->
#<--
soup.find_all("p", attrs={"class": "main-text"})
#-->
Links look like this:
<a href="http://linguistics.stanford.edu/documents/Final-1213-PhD-Handbook-8.pdf">text</a>
We want the 'href'
value, not the text of the link.
#<--
links = []
for a in soup.find_all("a"):
links.append(a['href'])
links
#-->
#<--
toy_contents = downloader("http://web.stanford.edu/class/linguist278/data/toyhtml.html")
#-->
#<--
toy_soup = Soup(toy_contents, "lxml")
#-->
#<--
print(toy_soup.prettify())
#-->
#<--
for elem in toy_soup.find_all("h2"):
print(elem.string)
#-->
These are h1, h2, h3, h4, ...
for elem in toy_soup.find_all(re.compile("^h\d+$")):
print(elem.name)
#<--
groceries = []
for elem in toy_soup.find_all("ul", attrs={"class": "groceries"}):
for li in elem.find_all("li"):
groceries.append(" ".join(li.stripped_strings))
groceries
#-->
Extract the movie table from the page and create a dict
mapping movie titles to pairs (dir, year)
where dir
is the movie's director and year is an int
giving its release year.
#<--
movie_data = {}
tbl = toy_soup.find("table", attrs={"class": 'movies'})
for tr in tbl.find_all("tr"):
tds = tr.find_all("td")
if tds:
movie_data[tds[0].string] = (tds[1].text, int(tds[2].text))
movie_data
#-->
Complete pdf_crawler
according to specification defined by its docstring.
For experimentation, please use the the link supplied by default. This takes you to a directory in our class space. You can see the contents of this directory here:
http://www.stanford.edu/class/linguist278/data/crawler/
You can also start at one of these other pages if you like:
import time
def pdf_crawler(
base_link="http://www.stanford.edu/class/linguist278/data/crawler/start.html",
output_dirname=".",
depth=0):
"""Crawls webpages looking for PDFs to download. The function branches out
from the original to the specified depth, and saves the files locally in
`output_dirname`. For each HTML page H linked from link, it follows H and
calls `pdf_crawler` on H. It continues this crawling to `depth`, supplied
by the user. `depth=0` just does the user's page, `depth=1` follows the
links at the user's page, downloads those files, but doesn't go further,
and so forth.
The function is respectful in that it uses the time module to rest between
downloads, so that no one's server gets slammed:
https://docs.python.org/3.6/library/time.html
It avoids downloading files it already has; see `_download_pdf`, which is
completed for you (but could be improved).
The page search aspects of this can be done purely with Requests, but I used
BeautifulSoup.
Parameters
----------
base_link : str
Page to start crawling from
output_dirname : str
Local directory in which to save the downloaded PDFs
depth : int
How many pages out from the source to go
"""
#<--
# Make sure we get sensible depth values:
if depth < 0:
return None
page = downloader(base_link)
soup = Soup(page, "lxml")
base = os.path.split(base_link)[0]
# Get the PDF links and download each:
for pdf in soup.find_all("a", attrs={"href": re.compile(r"\.pdf$")}):
_download_pdf(base, pdf['href'], output_dirname)
time.sleep(1)
# Get the HTML links:
if depth > 0:
for html in soup.find_all("a", attrs={"href": re.compile(r"\.html$")}):
link = "{}/{}".format(base, html['href'])
pdf_crawler(link, output_dirname, depth-1)
#-->
def _download_pdf(base, filename, output_dirname):
"""Handles the PDF downloading part for `pdf_crawler`. This function checks
to see whether it already has the file and avoids redownloading if it does,
and it also checks that the file status code is 200, to avoid trying to
download files that don't exist (status 404) or that are otherwise
unavailable.
Parameters
----------
base : str
The base of the link (the Web directory that contains the file)
filename : str
The basename of the file
output_dirname : str
Local directory in which to save the downloaded PDF
"""
output_filename = os.path.join(output_dirname, filename)
# Don't redownload a file we already have:
if os.path.exists(output_filename):
return None
# PDF link:
link = "{}/{}".format(base, filename)
req = requests.get(link)
# Does the file exist?
if req.status_code != 200:
print("Couldn't download {}: status code {}".format(link, req.status_code))
else:
# Download if the file is present:
with open(output_filename, 'wb') as f:
f.write(req.content)
pdf_crawler(output_dirname="tmp", depth=2)