CS 276 / LING 286: Information Retrieval and Web Search

Lecture: 3 units, Tu/Th 4:15-5:30 at NVIDIA Auditorium (available online through SCPD)
Staff e-mail: cs276-spr1314-staff@lists.stanford.edu

Course Description

Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean and vector space retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification; text mining.

Policies Information (grading, etc.)

Prerequisites: CS 107, CS 109, CS 161. Ideally the whole CS Major Core.
This year's programming assignments will be in Java. The material here may be helpful to refresh your Java knowledge.

Online Resources

Coursera: CS 276 will be making use of online videos and quizzes as well as live class lectures, presentations, and labs. Visit Coursera to find video chunks and online exercises. We will be posting the required programming assignments and problem sets through Coursera, so signup soon (with your Stanford ID)!

Piazza: We strongly recommend that you post any questions about assignments, lectures or general course material on Piazza. It facilitates lively discussion among students, and allows others to benefit from the discussion too. Even if you think your question is specific to your implementation, you can use the 'private question' feature to address a question to the staff specifically. We will be using Piazza to post course announcements, so be sure to signup on Piazza soon too. Here is a quick introduction video.

Staff Email: If you really have a question about your situation in specific, that you do not think is not appropriate for the class forum (not even a private question), please email the staff mailing list at cs276-spr1314-staff@lists.stanford.edu.

Announcements:
If you are not registered in the course but wish to receive course announcements, you may subscribe to the guest mailing list cs276-spr1314-guests.

CS 276 Staff Information

Professors:
Chris Manning, Office Hours: Tuesday Gates 158 2:00 pm - 3:00 pm
Pandu Nayak, Office Hours: after class with prior arrangement

TAs: Thang Luong (Head TA), Alan Newman, Haoran Li, Wenxiao Du, Kyle Anderson, Jiayuan Ma

TA Office Hours

For a detailed schedule, check out the Google calendar using the link here.

Daytime Office Hours

Evening Code Sessions

For SCPD students, we will be available via Google hangout during office hours, links of which will be posted on the Piazza post here.


Syllabus

Details of the schedule, videos, slides and reading lists will be updated as the quarter progresses.

Week Date In Class Assignments Who Materials
Week 1 Tue 1 Apr Introduction to course: Discussion of issues in search plus Web 1: Web search, ads, SEO PN Coursera:
Boolean Retrieval (IIR Ch. 1)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 1
MG 3.2
MIR 8.2
Shakespeare plays
Thu 3 Apr Class lab: Writing a merge algorithm for proximity queries using a positional index
Shared gdoc: http://bit.ly/HgCAdP
Intersection skeleton
Positional intersection skeleton

[answers requires SUNet ID]
Intersection answer
Positional intersection answer
PS1 Out
PA1 Out
CM Coursera:
Term Vocabulary and Postings Lists (IIR Ch. 2)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings
IIR Ch. 2
MG Ch. 3.6 4.3
MIR 7.2
Porter's stemmer (MIR), Porter stemming algorithm (Official)
A skip list cookbook (Pugh 1990)
Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002)
Week 2 Tue 8 Apr Class lab: Algorithms for postings list compression
IndexCompression.java
Shared gdoc
Results gdoc spreadsheet
Jeff Dean's compression slides
Wikipedia: Golumb/Rice, Huffman, Elias gamma code, unary code
PN Coursera:
Index Compression (IIR Ch. 5)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 5
MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation (Scholer et al. 2002)
Inverted index compression using word-aligned binary codes (Anh and Moffat 2005)
Thu 10 Apr Guest speaker: Jeff Dean on the evolution of Google's search and retrieval system
Jeff Dean's slides
PS1 Due Jeff Coursera:
Index Construction (IIR Ch. 4)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 4
Efficient single-pass index construction for text databases (Heinz and Zobel 2003) MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004)
Week 3 Tue 15 Apr Spelling correction
[powerpoint]
[PDF/6]
[PDF/1]
CM Coursera:
Dictionaries and Tolerant Retrieval (IIR Ch. 3)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
How to write a spelling corrector (Peter Norvig)
IIR Ch. 3
MG 4.2
Techniques for automatically correcting words in text (Kukich 1992)
Finding approximate matches in large lexicons (Zobel and Dart 1995)
Efficient Generation and Ranking of Spelling Error Corrections (Tillenius)
Thu 17 Apr Probabilistic IR: Binary Independence Model
[powerpoint]
[PDF/6]
[PDF/1]
PA1 Due
PS2 Out
PA2 Out
CM Coursera:
Vector Space Model (IIR Ch. 6)
[powerpoint]
[PDF/6]
[PDF/1]
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 6
IIR Ch. 11
Week 4 Tue 22 Apr Computing Scores and BM25F
[powerpoint]
[PDF/6]
[PDF/1]
CM Coursera:
Computing Scores (IIR Ch. 7)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 11
IIR Ch. 7
Thu 24 Apr Evaluation. Precision/recall, NDCG, using clickthrough rates
[powerpoint]
[PDF/6]
[PDF/1]
PS2 Due PN Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 8
MG 4.5
MIR Ch. 3
Week 5 Tue 29 Apr Class lab: Mapreduce with Java
[Instructions on installing Hadoop and running mapreduces]
PN Notes:
[Jeff Dean's MapReduce Slides]
Readings:
IIR 4.4
Thu 1 May Systems issues in efficient retrieval and scoring
[powerpoint]
[PDF/6]
[PDF/1]
PA2 Due
PS3 Out
PA3 Out
PN Readings:
IIR Ch. 7
IIR Ch. 6
Week 6 Tue 6 May Lucene Tutorial
[powerpoint]
[PDF/6]
[PDF/1]
CM
Thu 8 May CLASSIFICATION 1 + 2: Naive Bayes, kNN, decision boundaries
[powerpoint]
[PDF/6]
[PDF/1]
PS3 Due CM Readings:
IIR Ch. 13 and 14
Machine learning in automated text categorization (Sebastiani 2002)
A re-examination of text categorization methods (Yang et al. 1999)
A Comparison of event models for naive Bayes text classification (McCallum et al. 1998)
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
Open Calais
Weka
Reuters-21578
Tackling the poor assumptions of Naive Bayes classifier (Rennie et al. 2003)
IIR Ch. 14
Machine learning in automated text categorization (Sebastiani 2002)
Tom Mitchell. Machine Learning.McGraw-Hill, 1997.
A re-examination of text categorization methods (Yang et al. 1999)

Evaluating and optimizing autonomous text classification systems (Lewis 1995)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
Week 7 Tue 13 May CLASSIFICATION 3. Support vector machines
[powerpoint]
[PDF/6]
[PDF/1]
CM Readings:
IIR Ch. 15
A tutorial on support vector machines for pattern recognition (Burges 1998)
Using SVMs for text categorization (Dumais1998)
Inductive learning algorithms and representations for text categorization (Dumaiset a. 1998)
A Re-examination of text categorization methods (Yang et al. 1999)
Text categorization based on regularized linear classification methods (Zhang et al. 2001)

Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag,New York, 2001.
Reuters-21578
Thorsten Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
A loss function analysis for classification methods in text categorization (Li et al. 2003)
Thu 15 May Guest speaker: Doug Cutting PA3 Due
PA4 Out
Doug Cutting
Week 8 Tue 20 May Web 2: Learning to rank.
[powerpoint]
[PDF/6]
[PDF/1]
CM Readings:
IIR 6.1.2-3, IIR 15.4
Discriminative models for information retrieval (Nallapati 2004)
Adapting ranking SVM to document retrieval (Cao et al. 2006)
A support vector method for optimizing average precision (Yue et al. 2007)
LETOR benchmark datasets
Thu 22 May CLUSTERING: k-means, HAC
[powerpoint]
[PDF/6]
[PDF/1]
PS4 Out PN Readings:
IIR Ch. 16, IIR 17.1-4
Week 9 Tue 27 May Web 3: Link analysis
[powerpoint]
[PDF/6]
[PDF/1]
PN Readings:
IIR Ch. 21
Ranking the web frontier (Eiron et al. 2004)
The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Extrapolation methods for accelerating PageRank computations (Kamvar et al. 2003)
Searching the workplace web (Fagin et al. 2003
Thu 29 May Web 4: Crawling, near-dups
[powerpoint]
[PDF/6]
[PDF/1]
PS4 Due PN Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 20
Mercator: A scalable, extensible web crawler (Heydon et al. 1999)
A standard for robot exclusion
Sun 1 June PA4 Due
Week 10 Tue 3 Jun Question Answering
[powerpoint]
[PDF/6]
[PDF/1]
CM
Thu 5 Jun No Class
Week 11 Monday 9 June Final Exam (12:15pm-3:15pm) Cubberley Auditorium
Directions

Required Textbook:

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008.

This book is available from the Stanford bookstore (or your favorite book purveyor). You can also download and print chapters at the book website. (We’d appreciate any reports of typos or of higher-level problems for the third printing. Thanks.)

Other Good IR Books:

MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
SE = Search Engines: Information Retrieval in Practice, by Bruce Croft, Donald Metzler and Trevor Strohman.
IRIE = Information Retrieval: Implementing and Evaluating Search Engines, by Stefan Buettcher, Charles L. A. Clarke and Gordon V. Cormack.

Lecturers:

CM = Chris Manning
PN = Pandu Nayak