Summer Workshop in Computational Social Science: Gathering, Cleaning and Processing Internet Data

Sean Westwood
September 20, 2011

Goals

  1. Learn methods to gather data
  2. Go over methods to clean downloaded content
  3. Go over the basics of requesting data through APIs

First steps

We will work on Stanford's servers

Open SSH (or PuTTY) and connect to corn.stanford.edu

ssh sunetid@corn.stanford.edu        

Gathering internet data (without a browser)

wget

GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP

Useful for simple downloads

and for recursive downloads (getting an entire website)

Working with Wget

wget http://www.google.com

Working with Wget

‘-r’

Turn on recursive retrieving.

‘-l depth’

Depth to download

wget -r -l 1 http://www.google.com        

This will download all files within a depth of 1 from google.com

Wget mirroring (getting everything)

‘-m’

Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps ftp directory listings.

cURL

Used to interactively request information from servers

Send variables to request specific data

Send cookies (authentication, etc.)

'Fake' form data

cURL

    curl http://google.com  
Basic request
curl-o example.html http://google.com  

Write output to a file instead of stdout

curl -O http://www.stanford.edu/index.html

Save a remote file and preserve the remote name

Exercise 1

Requesting data from a non-static source

Two general methods for sending data to a server over the HTTP protocol

get requests

embed information needed to respond to the request in the URL

https://www.google.com/search?q=iriss+stanford
https://www.google.com/search?q=iriss+stanford&hl=de 

Order doesn't matter, but the first variable must be preceded by a '?' and all others by an '&'

post requests

Can represent any kind of data of any length

Data are encoded in a similar way to get data

A form collecting your "Name" and "Location" would encode to:

Name=Sean+Westwood&Location=Palo+Alto

example: http://www.htmlcodetutorial.com/forms/_FORM_METHOD_POST.html

headers

HTTP headers contain information about all HTTP requests and responses. They include:

Request headers

  • Information about the browser (User-Agent, language, etc.)
  • Cookies (created in the past)
  • get

Response headers

  • Information about the request (status)
  • Content information (compression, mime-type, etc.)
  • Cookies (that will be created as a result of the request)

Request headers

Accept:text/html,application/xhtml+xml,application/xml;q=0.9;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:max-age=0
Connection:keep-alive
Cookie:RMID=19de3a0356b15005c00dbe25;...
Host:www.nytimes.com
Referer:http://www.nytimes.com/
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) 
AppleWebKit/537.8 (KHTML, like Gecko) 
Chrome/23.0.1255.0 Safari/537.8 

Response headers

HTTP/1.1 200 OK
Date: Wed, 19 Sep 2012 21:30:14 GMT
Server: Apache
expires: Thu, 01 Dec 1994 16:00:00 GMT
cache-control: no-cache
pragma: no-cache
Set-cookie: adxcl=l*2ea2d=5107574f:1|li=5107574f:1; expires=Thursday, 19-Sep-2013 21:30:14 GMT; path=/; domain=.nytimes.com
Content-Type: text/html; charset=UTF-8
Content-Encoding: gzip
Transfer-Encoding: chunked
        

Exercise 2

Exercise 3

Using cURL to get a story behind the NYTimes Paywall

cURL Example

Working with requested data (XML, HTML and the DOM)

Simple XHTML document

   
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>An example page</title> </head> <body> <h1>Heading text</h1> <p>A paragraph</p> </body> </html>

Node structure

Clear hierarchy of nodes

Simple XHTML DOM

Source: w3schools

Important note on XML, XHTML and XML

XHTML is XML, but HTML is not XML.

Methods for working with XML such as XPATH will not always work with HTML. This largely depends on the language and the implementation of XPATH.

Crap markup will often break XPATH and other tools.

XPATH

<body>
	<h1 id="introHeader">Heading text</h1>
	<p class="articleText">A paragraph</p>
	<p class="articleText">A second paragraph</p>
</body>
        

XPath is used to navigate through elements and attributes in an XML/HTML document.

XPATH

Select the first paragraph

 /html/body/p[1]

Select all paragraphs in body

/html/body/p

Select just the text in the paragraphs

/html/body/p/text()

XPATH

<body>
	<h1 id="introHeader">Heading text</h1>
	<p class="firstParagraph">A paragraph</p>
	<p class="articleText">A second paragraph</p>
</body>

Conditional selections (only the paragraph with class "firstParagraph")

//p[@class='firstParagraph']

"/" Selects from the root node
"//" Selects nodes in the document from the current node that match the selection no matter where they are
"@" Selects attributes

Exercise 4

Try the previous XPATH examples with http://www.bit-101.com/xpath/ and the following markup:

<html>
<head>
<title>An example page</title>
</head>
<body>
	<h1 id="introHeader">Heading text</h1>
	<p class="firstParagraph">A paragraph</p>
	<p class="articleText">A second paragraph</p>
</body>
</html>

Exercise 5

Using the following markup

  <html>
  <head>
  <title>An example page</title>
  </head>
  <body>
	<div id="main">
  		<h1 id="introHeader">Heading text</h1>
  		<p class="firstParagraph">A paragraph</p>
  		<p class="articleText">A second paragraph with a 
        	<a href="http://www.google.com">link to google</a>.</p>
	</div>
  </body>
  </html>

for help see: http://www.w3schools.com/xpath/xpath_syntax.asp

Exercise 5 - Possible solutions

Combining cURL and XPATH: getting event names from Stanford.edu

<?php
$curl = curl_init('http://www.stanford.edu');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = $xpath->query('//span[@class="event_title"]');

foreach ($result as $linkText) {
	echo $linkText->nodeValue."<br>";
}

Running php scripts on a server

php -f script.php

To create a file use nano (or vim)

nano filename.php

Excercise 4

Using the code on the previous slide as a starting point capture the headlines from http://www.nytimes.com/pages/politics/index.html

Bonus if you can get all the articles, including the featured article

Excercise 4 - Possible solution

<?php
$curl = curl_init('http://www.nytimes.com/pages/politics/index.html');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$result = $xpath->query('//div[@class="storyHeader"]/h1/a/text()');

foreach ($result as $linkText) {
	echo $linkText->nodeValue."<br>";
}

$result = $xpath->query('//div[@class="story"]/h3/a/text()');

foreach ($result as $linkText) {
	echo $linkText->nodeValue."<br>";
}

Excercise 5

Using the code from the last example use XPATH and cURL to download all the articles from the politics section of the New York Times

Name each file as n.html, where n is the current index of the article in the list of articles

Excercise 5 - Possible solution

<?php
$curl = curl_init('http://www.nytimes.com/pages/politics/index.html');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$result = $xpath->query('//div[@class="story"]/h3/a/@href');

$i=0;
foreach ($result as $linkText) {
	$curl = curl_init($linkText->nodeValue);
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
	curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
	$html = curl_exec($curl);
	curl_close($curl);
	file_put_contents($i."html", $html);
}

Another way to deal with HTML data: class and ID Selectors

<body>
	<div id="main">
<h1 id="introHeader">Heading text</h1>
<p class="firstParagraph">A paragraph</p>
<p class="articleText">A second paragraph with a <a href="http://www.google.com">link to google</a>.</p></div>
</body>

A class selector is a name preceded by a period (.) and an ID selector is a name preceded by a hash character (#).

An ID represents ONE element, whereas a class can represent any number of elements

jQuery-style selectors

#introHeader selects the h1 element

.articleText selects all the p elements

Additional information such as element names or additional classes come after the initial selector

To select the link in the second paragraph:

.articleText a

http://www.w3schools.com/jquery/jquery_ref_selectors.asp

JQuery-style selectors 2

Generally select a single node or a list of nodes

To access attributes (e.g., href of an 'a' tag) or content you must act select attributes of a node

Java

String URL = "http://www.spiegel.de/spiegel/print/index-" + year + "-" + issue + ".html";

Document document = Jsoup.connect(URL).timeout(12000).get();
Elements links = document.select("#spHeftInhalt a");
Integer articleNumber = 0;

for (Element link : links) {
	String linkHref = "http://www.spiegel.de" + link.attr("href");
	processFile(linkHref, year, issue, articleNumber)
	articleNumber++;  
}
Java class (also an example of how to create an XML document with Java

One more way: elements by name

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://feeds.nytimes.com/nyt/rss/Politics");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$newsScrape = curl_exec($ch);
$docNews = new DOMDocument();
$docNews->loadXML($newsScrape); 

$nodesArticles = $docNews->getElementsByTagName("item");
foreach($nodesArticles as $nodesArticle){
	$link = $nodesArticle->getElementsByTagName("link")->item(0)->nodeValue;
}

A smarter way to clean XML

http://boilerpipe-web.appspot.com/

Java

URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);

Data formats: XML and JSON

XML is a full markup language

JSON is an object notation

Comparing XML and JSON

<?xml version="1.0" encoding="UTF-8"?>
<articles>
	<article>
    	<source>The Political Methodologist</source>
		<title> Data from the Web into R</title>
		<author> Simon Jackman</author>
		<date> Fall 2006</date>
	</article>
</articles>

 

XML

[{
	"article": 
	[{
		"source":"The Political Methodologist",
		"title":" Data from the Web into R",
		"author":" Simon Jackman",
		"date":" Fall 2006"
	}]
}]

 

JSON

Exercise 6

Convert the following CSV data to XML and JSON

Email Name Department
aarefeva@stanford.edu Arefeva, Alina Econ
gallego@stanford.edu Gallego, Aina Poli Sci
kgleich@gmail.com Gleichauf, Karla Eng
akarama1@stanford.edu Karamalla, Ayman MS&E

XML Validator

JSON Validator

A web-based converter

http://shancarter.com/data_converter/ converts CSV or tab data to XML and JSON

Programmatic conversion from JSON and XML to csv requires custom code for each document/schema

Exercise 7

Convert the following XML to a flat file

<items>
  <item id="0001" type="donut">
		<name>Cake</name>
			<batters>
            	<batter>Regular</batter>
                <batter>Chocolate</batter>
                <batter>Blueberry</batter>
			</batters>
			<topping>None</topping>
			<topping>Glazed</topping>
			<topping>Sugar</topping>
			<topping>Sprinkles</topping>
			<topping>Chocolate</topping>
			<topping>Maple</topping>
  </item>
</items>