CS 276 / LING 286: Information Retrieval and Web Search

Lecture: 3 units, Tu/Th 4:15-5:30 at NVIDIA Auditorium (available online through SCPD)
Staff e-mail: cs276-spr1415-staff@lists.stanford.edu

Course Description

Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean and vector space retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification; text mining.

Policies Information (grading, etc.)

Prerequisites: CS 107, CS 109, CS 161. Ideally the whole CS Major Core.
This year's programming assignments will be in Java. The material here may be helpful to refresh your Java knowledge.

Online Resources

Coursera: CS 276 will be making use of online videos and quizzes as well as live class lectures, presentations, and labs. Visit Coursera to find video chunks and online exercises. We will be posting the required programming assignments and problem sets through Coursera, so signup soon (with your Stanford ID)!

Piazza: We strongly recommend that you post any questions about assignments, lectures or general course material on Piazza. It facilitates lively discussion among students, and allows others to benefit from the discussion too. Even if you think your question is specific to your implementation, you can use the 'private question' feature to address a question to the staff specifically. We will be using Piazza to post course announcements, so be sure to signup on Piazza soon too. Here is a quick introduction video.

Staff Email: If you really have a question about your situation in specific, that you do not think is not appropriate for the class forum (not even a private question), please email the staff mailing list at cs276-spr1415-staff@lists.stanford.edu.

Announcements:
If you are not registered in the course but wish to receive course announcements, you may subscribe to the guest mailing list cs276-spr1415-guests.

CS 276 Staff Information

Professors:
Chris Manning, Office Hours: Mondays Gates 248 2:00-3:00 pm Pandu Nayak, Office Hours: after class with prior arrangement

TAs: Angel Chang (Head TA), Amritha Raghunath, Ankur Bapna, Yuhao Zhang, Billy Jun, Jingrui Zhang

TA Office Hours

Daytime Office Hours

Evening Code Sessions

For SCPD students, we will be available via Google hangout during office hours, links of which will be posted on the Piazza post here.


Syllabus

Details of the schedule, videos, slides and reading lists will be updated as the quarter progresses.

Week Date In Class Assignments Who Materials
Week 1 Tue 31 March Introduction to course: Discussion of issues in search plus Web 1: Web search, ads, SEO PN Coursera:
Boolean Retrieval (IIR Ch. 1)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 1
MG 3.2
MIR 8.2
Shakespeare plays
Thu 2 Apr Class lab: Writing a merge algorithm for proximity queries using a positional index
In class exercise
PS1 Out
PA1 Out
CM Coursera:
Term Vocabulary and Postings Lists (IIR Ch. 2)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings
IIR Ch. 2
MG Ch. 3.6 4.3
MIR 7.2
Porter's stemmer (MIR), Porter stemming algorithm (Official)
A skip list cookbook (Pugh 1990)
Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002)
Week 2 Tue 7 Apr Class lab: Algorithms for postings list compression
In class exercise
CM Coursera:
Index Compression (IIR Ch. 5)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 5
MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation (Scholer et al. 2002)
Inverted index compression using word-aligned binary codes (Anh and Moffat 2005)
Thu 9 Apr Guest speaker: Jeff Dean on the evolution of Google's search and retrieval system
Jeff Dean's slides
PS1 Due Jeff Coursera:
Index Construction (IIR Ch. 4)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 4
Efficient single-pass index construction for text databases (Heinz and Zobel 2003) MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004)
Week 3 Tue 14 Apr Spelling correction
[powerpoint]
[PDF/6]
[PDF/1]
CM Coursera:
Dictionaries and Tolerant Retrieval (IIR Ch. 3)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
How to write a spelling corrector (Peter Norvig)
IIR Ch. 3
MG 4.2
Techniques for automatically correcting words in text (Kukich 1992)
Finding approximate matches in large lexicons (Zobel and Dart 1995)
Efficient Generation and Ranking of Spelling Error Corrections (Tillenius)
Thu 16 Apr Probabilistic IR: Binary Independence Model
[powerpoint]
[PDF/6]
[PDF/1]
PA1 Due
PS2 Out
PA2 Out
CM Coursera:
Vector Space Model (IIR Ch. 6)
[powerpoint]
[PDF/6]
[PDF/1]
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 6
IIR Ch. 11
Week 4 Tue 21 Apr Computing Scores and BM25F
[powerpoint]
[PDF/6]
[PDF/1]
CM Coursera:
Computing Scores (IIR Ch. 7)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 11
IIR Ch. 7
Thu 23 Apr Class lab: Mapreduce with Java
[Instructions on installing Hadoop and running mapreduces]
In class exercise
PS2 Solutions
PS2 Due
PN Notes:
[Jeff Dean's MapReduce Slides]
Readings:
IIR 4.4
Week 5 Tue 28 Apr Recent evaluation, NDCG, using clickthrough; rate queries & results
[powerpoint]
[PDF/6]
[PDF/1]
PN Coursera:
Result Summaries (IIR Ch. 8)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 8
MG 4.5
MIR Ch. 3
Thu 30 Apr Guest speaker: Hugh Williams
[powerpoint]
[PDF]
PA2 Due
PS3 Out
PA3 Out
Hugh
Week 6 Tue 5 May Systems issues in efficient retrieval and scoring
[powerpoint]
[PDF/6]
[PDF/1]
PN Readings:
IIR Ch. 7
IIR Ch. 6
Efficient Query Evaluation using a Two-Level Retrieval Process (Broder et al 2003)
Thu 7 May CLASSIFICATION 1 + 2: Naive Bayes, kNN, decision boundaries
[powerpoint]
[PDF/6]
[PDF/1]
PS3 Due PN Coursera:
Naive Bayes (IIR Ch. 13)
Readings:
IIR Ch. 13 and IIR Ch. 14
Machine learning in automated text categorization (Sebastiani 2002)
A re-examination of text categorization methods (Yang et al. 1999)
A Comparison of event models for naive Bayes text classification (McCallum et al. 1998)
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
Open Calais
Weka
Reuters-21578
Tackling the poor assumptions of Naive Bayes classifier (Rennie et al. 2003)
Machine learning in automated text categorization (Sebastiani 2002)
Tom Mitchell. Machine Learning.McGraw-Hill, 1997.
A re-examination of text categorization methods (Yang et al. 1999)

Evaluating and optimizing autonomous text classification systems (Lewis 1995)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
Week 7 Tue 12 May CLASSIFICATION 3. Support vector machines
[powerpoint]
[PDF/6]
[PDF/1]
CM Readings:
IIR Ch. 15
A tutorial on support vector machines for pattern recognition (Burges 1998)
Using SVMs for text categorization (Dumais1998)
Inductive learning algorithms and representations for text categorization (Dumaiset a. 1998)
A Re-examination of text categorization methods (Yang et al. 1999)
Text categorization based on regularized linear classification methods (Zhang et al. 2001)

Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag,New York, 2001.
Reuters-21578
Thorsten Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
A loss function analysis for classification methods in text categorization (Li et al. 2003)
Thu 14 May Web 2: Learning to rank.
[powerpoint]
[PDF/6]
[PDF/1]
PA3 Due
PA4 Out
CM Readings:
IIR 6.1.2-3, IIR 15.4
Discriminative models for information retrieval (Nallapati 2004)
Adapting ranking SVM to document retrieval (Cao et al. 2006)
A support vector method for optimizing average precision (Yue et al. 2007)
LETOR benchmark datasets
Week 8 Tue 19 May Distributed word representations for IR
[powerpoint]
[PDF/6]
[PDF/1]
CM Readings:
Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al, 2013)
GloVe: Global Vectors for Word Representation (Pennington et all, 2014)
Thu 21 May Lucene Tutorial
[powerpoint]
[PDF/6]
[PDF/1]
PS4 Out PN
Week 9 Tue 26 May Web 3: Link analysis
[powerpoint]
[PDF/6]
[PDF/1]
PN Readings:
IIR Ch. 21
Ranking the web frontier (Eiron et al. 2004)
The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Extrapolation methods for accelerating PageRank computations (Kamvar et al. 2003)
Searching the workplace web (Fagin et al. 2003
Thu 28 May Web 4: Crawling, near-dups
[powerpoint]
[PDF/6]
[PDF/1]
PS4 Solutions
PS4 Due
PA4 Due
PN Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 20
Mercator: A scalable, extensible web crawler (Heydon et al. 1999)
A standard for robot exclusion
Week 10 Tue 2 Jun Question Answering
[powerpoint]
[PDF/6]
[PDF/1]
CM
Thu 4 Jun No Class
Week 11 Tue 9 Jun Final Exam
12:15-3:15pm Hewlett200
TBD
Directions

Required Textbook:

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008.

This book is available from the Stanford bookstore (or your favorite book purveyor). You can also download and print chapters at the book website. (We’d appreciate any reports of typos or of higher-level problems for the third printing. Thanks.)

Other Good IR Books:

MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
SE = Search Engines: Information Retrieval in Practice, by Bruce Croft, Donald Metzler and Trevor Strohman.
IRIE = Information Retrieval: Implementing and Evaluating Search Engines, by Stefan Buettcher, Charles L. A. Clarke and Gordon V. Cormack.

Lecturers:

CM = Chris Manning
PN = Pandu Nayak