CS 276 / LING 286: Information Retrieval and Web Search

Course Information

Description: Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean and vector space retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification; text mining.

Lectures: 3 units, Tu/Th 4:30-5:50 at NVIDIA Auditorium (available online through SCPD)

Course policies: details here

Prerequisites: CS 107, CS 109, CS 161. Ideally the whole CS Major Core.
This year's programming assignments will be in Java. The material here may be helpful to refresh your Java knowledge.

Online Resources

Coursera: CS 276 will be making use of online videos and quizzes as well as live class lectures, presentations, and labs. Visit Coursera to find video chunks. To sign up, visit this signup link, note that you need to sign up with your @stanford.edu email address.

Piazza: We strongly recommend that you post any questions about assignments, lectures or general course material on Piazza. It facilitates lively discussion among students, and allows others to benefit from the discussion too. Even if you think your question is specific to your implementation, you can use the 'private question' feature to address a question to the staff specifically. We will be using Piazza to post course announcements, so be sure to signup on Piazza soon too. Here is a quick introduction video.

Gradescope: All students (SCPD and non-SCPD) submit their assignments via Gradescope. To register for CS276 on Gradescope, first create an account on the website if don't have one already. Join CS276 using the entry code 9WNEZM.

Staff Email: If you really have a question about your situation in specific, that you do not think is not appropriate for the class forum (not even a private question), please email the staff mailing list at cs276-spr1516-staff@lists.stanford.edu

Announcements: If you are not registered in the course but wish to receive course announcements, you may subscribe to the guest mailing list cs276-spr1516-guests.

CS 276 Staff Information

Professors:
Chris Manning, Office Hours: Tuesday 3:00 pm - 4:00 pm, Gates 248
Pandu Nayak, Office Hours: After lecture, by appointment

TAs: Panupong Pasupat (Head TA), Pujun Bhatnagar, Jade Huang, Jiang Han, Shubham Gupta, Nihit Desai

TA Office Hours

Daytime Office Hours

Evening Code Sessions

For the most part, the weekly schedule below is fixed. However, occasionally there might be changes to schedule given above. Please check this calendar before visiting office hours. We will try to avoid last minute changes and will make a Piazza announcement if necessary. For SCPD students, we will be available via Google hangout. Hangout link will be posted on Piazza.


Syllabus

Details of the schedule, videos, slides and reading lists will be updated as the quarter progresses.

Week Date In Class Assignments
(Due 11:59pm)
Who Materials
Week 1 3/29 Introduction to the course PN Coursera:
Boolean Retrieval (IIR Ch. 1)
Notes:
[powerpoint] [PDF/6] [PDF/1]
Readings:
IIR Ch. 1
MG 3.2
MIR 8.2
Shakespeare plays
3/31 Lab: Merge algorithm for proximity queries using a positional index
Starter code
PA1 Out CM Coursera:
Term Vocabulary and Postings Lists (IIR Ch. 2)
Notes:
[powerpoint] [PDF/6] [PDF/1]
Readings
IIR Ch. 2
MG Ch. 3.6 4.3
MIR 7.2
Porter's stemmer (MIR), Porter stemming algorithm (Official)
A skip list cookbook (Pugh 1990)
Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002)
Week 2 4/05 Lab: Algorithms for postings list compression
Starter code
CM Coursera:
Index Compression (IIR Ch. 5)
Notes:
[powerpoint] [PDF/6] [PDF/1]
Readings:
IIR Ch. 5
MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation (Scholer et al. 2002)
Inverted index compression using word-aligned binary codes (Anh and Moffat 2005)
4/07 Spelling correction
[PDF/6] [PDF/1]
PS1 Out
Query Quiz
PN Coursera:
Dictionaries and Tolerant Retrieval (IIR Ch. 3)
[powerpoint] [PDF/6] [PDF/1]
Readings:
How to write a spelling corrector (Peter Norvig)
IIR Ch. 3
MG 4.2
Techniques for automatically correcting words in text (Kukich 1992)
Finding approximate matches in large lexicons (Zobel and Dart 1995)
Efficient Generation and Ranking of Spelling Error Corrections (Tillenius)
Week 3 4/12 Guest speaker: Udi Manber
(Attendance required for on-campus students)
Udi
4/14 Probabilistic IR: Binary Independence Model
[powerpoint] [PDF/6] [PDF/1]
PA1 Due
PA2 Out
CM Coursera:
Vector Space Model (IIR Ch. 6)
[powerpoint] [PDF/6] [PDF/1]
Readings:
IIR Ch. 6
IIR Ch. 11
Week 4 4/19 Computing Scores and BM25F
[powerpoint] [PDF/6] [PDF/1]
CM Coursera:
Computing Scores (IIR Ch. 7)
Notes:
[powerpoint] [PDF/6] [PDF/1]
Readings:
IIR Ch. 11
IIR Ch. 7
4/21 Recent evaluation, NDCG, using clickthrough; rate queries & results
[powerpoint] [PDF/6] [PDF/1]
PS1 Due
PS1 Solutions
Ranking Quiz
PN Coursera:
Result Summaries (IIR Ch. 8)
Notes:
[powerpoint] [PDF/6] [PDF/1]
Readings:
IIR Ch. 8
MG 4.5
MIR Ch. 3
Week 5 4/26 Systems issues in efficient retrieval and scoring
[powerpoint] [PDF/6] [PDF/1]
PN Readings:
IIR Ch. 7
IIR Ch. 6
Efficient Query Evaluation using a Two-Level Retrieval Process (Broder et al 2003)
4/28 Class lab: Mapreduce with Java
[Instructions on installing Hadoop and running mapreduces]
In class exercise
Solutions
PA2 Due
PA3 Out
PN Notes:
[Jeff Dean's MapReduce Slides]
Readings:
IIR 4.4
Week 6 5/03 CLASSIFICATION 1 + 2: Naive Bayes, kNN, decision boundaries
[powerpoint] [PDF/6] [PDF/1]
PN Coursera:
Naive Bayes (IIR Ch. 13)
Readings:
IIR Ch. 13 and IIR Ch. 14
Machine learning in automated text categorization (Sebastiani 2002)
A re-examination of text categorization methods (Yang et al. 1999)
A Comparison of event models for naive Bayes text classification (McCallum et al. 1998)
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
Open Calais
Weka
Reuters-21578
Tackling the poor assumptions of Naive Bayes classifier (Rennie et al. 2003)
Machine learning in automated text categorization (Sebastiani 2002)
Tom Mitchell. Machine Learning.McGraw-Hill, 1997.
A re-examination of text categorization methods (Yang et al. 1999)

Evaluating and optimizing autonomous text classification systems (Lewis 1995)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
5/05 CLASSIFICATION 3. Support vector machines
[powerpoint] [PDF/6] [PDF/1]
PS2 Out CM Readings:
IIR Ch. 15
A tutorial on support vector machines for pattern recognition (Burges 1998)
Using SVMs for text categorization (Dumais1998)
Inductive learning algorithms and representations for text categorization (Dumaiset a. 1998)
A Re-examination of text categorization methods (Yang et al. 1999)
Text categorization based on regularized linear classification methods (Zhang et al. 2001)

Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag,New York, 2001.
Reuters-21578
Thorsten Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
A loss function analysis for classification methods in text categorization (Li et al. 2003)
Week 7 5/10 Learning to rank.
[powerpoint] [PDF/6] [PDF/1]
CM Readings:
IIR 6.1.2-3, IIR 15.4
Discriminative models for information retrieval (Nallapati 2004)
Adapting ranking SVM to document retrieval (Cao et al. 2006)
A support vector method for optimizing average precision (Yue et al. 2007)
LETOR benchmark datasets
5/12 Link analysis
[powerpoint] [PDF/6] [PDF/1]
PA3 Due
PA4 Out
PN Readings:
IIR Ch. 21
Ranking the web frontier (Eiron et al. 2004)
The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Extrapolation methods for accelerating PageRank computations (Kamvar et al. 2003)
Searching the workplace web (Fagin et al. 2003
Week 8 5/17 Crawling, near-dups
[powerpoint] [PDF/6] [PDF/1]
PN Notes:
[powerpoint] [PDF/6] [PDF/1]
Readings:
IIR Ch. 19 and IIR Ch. 20
Mercator: A scalable, extensible web crawler (Heydon et al. 1999)
A standard for robot exclusion
5/19 Guest speaker: Jeff Dean
(Attendance required for on-campus students)
PS2 Due
PS2 Solutions
Jeff
Week 9 5/24 Distributed word representations for IR
[powerpoint] [PDF/6] [PDF/1]
CM Readings:
Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al, 2013)
GloVe: Global Vectors for Word Representation (Pennington et all, 2014)
5/26 Personalization
[powerpoint] [PDF/6] [PDF/1]
PA4 Due PN Readings:
J. Teevan, S. Dumais, E. Horvitz. Potential for personalization. 2010
J. Pitkow et al. Personalized search. 2002
J. Teevan, S. Dumais, E. Horvitz. Personalizing search via automated analysis of interests and activities. 2005
P. Bennett et al. Inferring and using location metadata to personalize Web search. 2011
T. Haveliwala. Topic-sensitive pagerank. 2002.
G. Jeh and J. Widom. Scaling personalized Web search. 2003
M. Curtiss et al. Unicorn: A system for searching the social graph. 2013
Week 10 5/31 Question Answering
[powerpoint] [PDF/6] [PDF/1]
CM
Week 10 6/04 Alternative Final Exam
Time: 3:30-6:30pm
Location: CEMEX Auditorium (in business school)
Week 11 6/08 Final Exam
Practice Final
Solutions
Time: 3:30-6:30pm
Location: NVIDIA Auditorium (our usual classroom)

Required Textbook:

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008.

This book is available from the Stanford bookstore (or your favorite book purveyor). You can also download and print chapters at the book website. (We’d appreciate any reports of typos or of higher-level problems for the third printing. Thanks.)

Other Good IR Books:

MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
SE = Search Engines: Information Retrieval in Practice, by Bruce Croft, Donald Metzler and Trevor Strohman.
IRIE = Information Retrieval: Implementing and Evaluating Search Engines, by Stefan Buettcher, Charles L. A. Clarke and Gordon V. Cormack.

Lecturers:

CM = Chris Manning
PN = Pandu Nayak