CS 276A / LING 239I 
MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
MIR = Modern Information Retrieval, by R. BaezaYates and
B. RibeiroNeto.
FOA = Finding Out About, by R. Belew.
FSNLP = Foundations of Statistical Natural Language Processing,
by C. Manning and H. Schütze.
Information Retrieval: Algorithms and Heuristics by D. Grossman and
O. Frieder
These books all have useful information on topics that we cover and are recommended as references. None of them are required textbooks or cover all of the material in this course. MG is particularly good for technical IR in the first half of the course.
CM = Christopher Manning
PR = Prabhakar Raghavan
LE = Louis Eisenberg
DG = Daniel Gindikin
All lectures, exams and review sessions are in Gates B01.
Lectures are Tuesdays and Thursdays from 4:15 to 5:30.
Review sessions, when scheduled, are Friday from 3:15 to 4:05.
Details of the schedule, slides and reading lists will be updated as
the quarter progresses. Assignment dates are subject to change.
We will schedule review sessions for assignments, probably on Friday afternoons.
Date  Topics  Notes  Who  Readings  Assignments 

Tue 28 Sep  IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semistructured text. Course administrivia. 
[powerpoint] [PDF] 
CM 
MG Sec. 3.03.2; MIR Sec. 8.18.2 Shakespeare plays 

Thu 30 Sep  IR 2. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Further optimizing indices for query processing. Proximity and phrase queries. Positional indices. 
[powerpoint] [PDF] 
PR  MG Ch. 3.7, 4.24.3; MIR Ch. 7.2 (8.5.7) Porter's stemmer Porter from the author Lovins stemmer Fast Phrase Querying with Combined Indexes 

Tue 5 Oct  IR 3. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law. Blocking. Extreme compression. 
[powerpoint] [PDF] 
PR  MG 3.3, 3.4 Compression of inverted indexes for fast query evaluation Explanation of Elias gamma codes (note that in our class unary codes are flipped from what's on this page, i.e. ones followed by a zero instead of zeroes followed by a one) 
PS1 out [doc] [PDF] 
Thu 7 Oct  IR 4. Query expansion: spelling correction and synonyms. Wildcard queries, permuterm indices, ngram indices. Edit distance, soundex, language detection. 
[powerpoint] [PDF] 
PR  MG Ch. 5; J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software  practice and experience 25(3), March 1995. 

Fri 8 Oct  Review session for PS1 
[doc] [PDF] 
LE  
Tue 12 Oct  IR 5. Index construction. Postings size estimation, merge sort, dynamic indexing, positional indexes, ngram indexes, realworld issues. 
[powerpoint] [PDF] 
PR 
MG Ch. 4.44.6; MIR 2.5, 2.7.2 
PE1 out [doc] [PDF] 
Thu 14 Oct  IR 6. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. Scoring documents. 
[powerpoint] [PDF] 
PR  PS1 due  
Fri 15 Oct  Review session for PE1 
[doc] [PDF] 
LE  
Tue 19 Oct  IR 7. Vector space scoring. The cosine measure. Efficiency considerations. Nearest neighbor techniques, reduced dimensionality approximations, random projection. 
[powerpoint] [PDF] 
PR 
MG 4.5, MIR Ch. 3 

Thu 21 Oct  IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, Fmeasure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval. 
[powerpoint] [PDF] 
CM 
Automatic
Word Sense Discrimination 
PE1 due 
Fri 22 Oct  Review session for midterm 
[doc] [PDF] 
LE  
Tue 26 Oct  Midterm to be held inclass 
[doc] [PDF] 

Thu 28 Oct  IR 9. Relevance feedback. Pseudo relevance feedback. Query expansion. Automatic thesaurus generation. Sensebased retrieval. Experimental results of performance effectiveness. 
[powerpoint] [PDF] 
CM 
MG Ch 4.7; MIR Ch 5.25.4 Learning Routing Queries in a Query Zone Concept Based Query Expansion New Retrieval Approaches Using SMART: TREC 4 Gerard Salton and Chris Buckley. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4):288297, 1990. Jinxi Xu and W. Bruce Croft: Query Expansion Using Local and Global Document Analysis. SIGIR 1996: 411 

Tue 2 Nov  PROBABILITIES 1. Probabilistic models for text problems. Classical probabilistic IR. Language models. 
[powerpoint] [PDF] 
CM 
MIR 2.5.4, 2.8 C. J. van Rijsbergen, Information Retrieval, ch. 6 Probabilistic Models in Information Retrieval (Fuhr 92) E. Charniak, Bayesian Networks without Tears H. Turtle's dissertation 

Thu 4 Nov  PROBABILITIES 2. Introduction to text classification. Naive Bayes models. Spam filtering. 
[powerpoint] [PDF] 
CM 
Machine
Learning in Automated Text Categorization A Comparison of Event Models for Naive Bayes Text Classification Tom Mitchell. Machine Learning. McGrawHill, 1997. A Reexamination of Text Categorization Methods A Plan for Spam (Paul Graham). Better Bayesian Filtering (Paul Graham). 2003 Spam Conference proceedings 

Tue 9 Nov  PROBABILITIES 3. Probabilistic language models for IR. Bayesian nets for IR. 
[powerpoint] [PDF] 
CM 
J.M. Ponte and
W.B. Croft. 1998 A. Berger and J. Lafferty. 1999 Workshop on Language Modeling and Information Retrieval, CMU, 2001 The Lemur Toolkit for Language Modeling and Information Retrieval 
PS2 out [doc] [PDF] 
Thu 11 Nov  CLUSTERING 1. Introduction to the problem. Partitioning methods. kmeans clustering. Mixture of gaussians model. Clustering versus classification. 
[powerpoint] [PDF] 
PR 
Scatter/Gather Data Clustering Review Initialization of iterative refinement clusting algorithms Scaling Clustering Algorithms to Large Databases 

Fri 12 Nov  Review session for PS2 
[doc] [PDF] 
DG  
Tue 16 Nov  CLUSTERING 2. Hierarchical agglomerative clustering. Clustering terms using documents. Labelling clusters. Evaluating clustering. Textspecific issues. 
[powerpoint] [PDF] 
PR  FSNLP Ch. 13 SingleLink and CompleteLink Clustering  PE2 out [doc] [PDF] 
Thu 18 Nov  CLUSTERING 3. Reduced dimensionality/spectral methods. Latent semantic indexing (LSI). Applications to clustering and to information retrieval. 
[powerpoint] [PDF] 
PR 
FSNLP 15.4 Projections for Efficient Document Clustering Spectral Analysis of Data Latent semantic indexing Random projection theorem Faster random projection 
PS2 due 
Fri 19 Nov  Review session for PE2 
[doc] [PDF] 
LE, DG  
Tue 23 Nov  CLASSIFICATION 1. Vector space classification using hyperplanes; centroids; k Nearest Neighbors. 
[powerpoint] [PDF] 
CM  FSNLP Ch. 16 A Comparative Study on Feature Selection in Text Categorization Evaluating and Optimizing Autonomous Text Classification Systems Dumais, Platt, Heckerman, and Sahami. 1998. Inductive learning algorithms and representations for text categorization. CIKM 1998. Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. SpringerVerlag, New York, 2001. Reuters dataset 

Thu 25 Nov  Thanksgiving (no class)  
Tue 30 Nov  CLASSIFICATION 2. Support vector machine classifiers. Kernel functions. 
[powerpoint] [PDF] 
CM  FSNLP Ch. 16 A Tutorial on Support Vector Machines for Pattern Recognition Conceptual Information Retrieval Using RUBRIC Dumais. Using SVMs for Text Categorization. IEEE Intelligent Systems 13(4) JulAug 1998. Text Categorization Based on Regularized Linear Classification Methods  PE2 due 
Thu 2 Dec  CLASSIFICATION 3. Text classification. Exploiting textspecific features. Feature selection. Evaluation of classification. Micro and macroaveraging. Comparative results. 
[powerpoint] [PDF] 
CM 
Machine
Learning in Automated Text Categorization A Reexamination of Text Categorization Methods A Comparative Study on Feature Selection in Text Categorization 

Fri 3 Dec  Review session for final 
[doc] [PDF] 
LE, DG  
Thu 9 Dec  Final Exam 12:15pm3:15pm 