Stanford University

CS 276A / LING 239I
Text Retrieval and Mining
Autumn 2004

Course Syllabus


MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder

These books all have useful information on topics that we cover and are recommended as references. None of them are required textbooks or cover all of the material in this course. MG is particularly good for technical IR in the first half of the course.


CM = Christopher Manning
PR = Prabhakar Raghavan
LE = Louis Eisenberg
DG = Daniel Gindikin

Locations and times:

All lectures, exams and review sessions are in Gates B01.
Lectures are Tuesdays and Thursdays from 4:15 to 5:30.
Review sessions, when scheduled, are Friday from 3:15 to 4:05.

Tentative Schedule:

Details of the schedule, slides and reading lists will be updated as the quarter progresses. Assignment dates are subject to change.
We will schedule review sessions for assignments, probably on Friday afternoons.

Date Topics Notes Who Readings Assignments
Tue 28 Sep IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semi-structured text. Course administrivia. [powerpoint]
CM MG Sec. 3.0-3.2; MIR Sec. 8.1-8.2
Shakespeare plays
Thu 30 Sep IR 2. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Further optimizing indices for query processing. Proximity and phrase queries. Positional indices. [powerpoint]
PR MG Ch. 3.7, 4.2-4.3; MIR Ch. 7.2 (8.5.7)
Porter's stemmer
Porter from the author
Lovins stemmer
Fast Phrase Querying with Combined Indexes
Tue 5 Oct IR 3. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law. Blocking. Extreme compression. [powerpoint]
PR MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation
Explanation of Elias gamma codes (note that in our class unary codes are flipped from what's on this page, i.e. ones followed by a zero instead of zeroes followed by a one)
PS1 out
Thu 7 Oct IR 4. Query expansion: spelling correction and synonyms. Wild-card queries, permuterm indices, n-gram indices. Edit distance, soundex, language detection. [powerpoint]
PR MG Ch. 5;
J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software - practice and experience 25(3), March 1995.
Fri 8 Oct Review session for PS1 [doc]
Tue 12 Oct IR 5. Index construction. Postings size estimation, merge sort, dynamic indexing, positional indexes, n-gram indexes, real-world issues. [powerpoint]
PR MG Ch. 4.4-4.6; MIR 2.5, 2.7.2
PE1 out
Thu 14 Oct IR 6. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. Scoring documents. [powerpoint]
PR   PS1 due
Fri 15 Oct Review session for PE1 [doc]
Tue 19 Oct IR 7. Vector space scoring. The cosine measure. Efficiency considerations. Nearest neighbor techniques, reduced dimensionality approximations, random projection. [powerpoint]
PR MG 4.5, MIR Ch. 3
Thu 21 Oct IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, F-measure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval. [powerpoint]
CM Automatic Word Sense Discrimination
PE1 due
Fri 22 Oct Review session for midterm [doc]
Tue 26 Oct Midterm to be held in-class [doc]
Thu 28 Oct IR 9. Relevance feedback. Pseudo relevance feedback. Query expansion. Automatic thesaurus generation. Sense-based retrieval. Experimental results of performance effectiveness. [powerpoint]
CM MG Ch 4.7; MIR Ch 5.2-5.4
Learning Routing Queries in a Query Zone
Concept Based Query Expansion
New Retrieval Approaches Using SMART: TREC 4
Gerard Salton and Chris Buckley. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4):288-297, 1990.
Jinxi Xu and W. Bruce Croft: Query Expansion Using Local and Global Document Analysis. SIGIR 1996: 4-11
Tue 2 Nov PROBABILITIES 1. Probabilistic models for text problems. Classical probabilistic IR. Language models. [powerpoint]
CM MIR 2.5.4, 2.8
C. J. van Rijsbergen, Information Retrieval, ch. 6
Probabilistic Models in Information Retrieval (Fuhr 92)
E. Charniak, Bayesian Networks without Tears
H. Turtle's dissertation
Thu 4 Nov PROBABILITIES 2. Introduction to text classification. Naive Bayes models. Spam filtering. [powerpoint]
CM Machine Learning in Automated Text Categorization
A Comparison of Event Models for Naive Bayes Text Classification
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
A Re-examination of Text Categorization Methods
A Plan for Spam (Paul Graham).
Better Bayesian Filtering (Paul Graham).
2003 Spam Conference proceedings
Tue 9 Nov PROBABILITIES 3. Probabilistic language models for IR. Bayesian nets for IR. [powerpoint]
CM J.M. Ponte and W.B. Croft. 1998
A. Berger and J. Lafferty. 1999
Workshop on Language Modeling and Information Retrieval, CMU, 2001
The Lemur Toolkit for Language Modeling and Information Retrieval
PS2 out
Thu 11 Nov CLUSTERING 1. Introduction to the problem. Partitioning methods. k-means clustering. Mixture of gaussians model. Clustering versus classification. [powerpoint]
PR Scatter/Gather
Data Clustering Review
Initialization of iterative refinement clusting algorithms
Scaling Clustering Algorithms to Large Databases
Fri 12 Nov Review session for PS2 [doc]
Tue 16 Nov CLUSTERING 2. Hierarchical agglomerative clustering. Clustering terms using documents. Labelling clusters. Evaluating clustering. Text-specific issues. [powerpoint]
PR FSNLP Ch. 13 Single-Link and Complete-Link Clustering PE2 out
Thu 18 Nov CLUSTERING 3. Reduced dimensionality/spectral methods. Latent semantic indexing (LSI). Applications to clustering and to information retrieval. [powerpoint]
Projections for Efficient Document Clustering
Spectral Analysis of Data
Latent semantic indexing
Random projection theorem
Faster random projection
PS2 due
Fri 19 Nov Review session for PE2 [doc]
Tue 23 Nov CLASSIFICATION 1. Vector space classification using hyperplanes; centroids; k Nearest Neighbors. [powerpoint]
A Comparative Study on Feature Selection in Text Categorization
Evaluating and Optimizing Autonomous Text Classification Systems
Dumais, Platt, Heckerman, and Sahami. 1998. Inductive learning algorithms and representations for text categorization. CIKM 1998.
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001. Reuters dataset
Thu 25 Nov Thanksgiving (no class)

Tue 30 Nov CLASSIFICATION 2. Support vector machine classifiers. Kernel functions. [powerpoint]
A Tutorial on Support Vector Machines for Pattern Recognition
Conceptual Information Retrieval Using RUBRIC
Dumais. Using SVMs for Text Categorization. IEEE Intelligent Systems 13(4) Jul-Aug 1998.
Text Categorization Based on Regularized Linear Classification Methods
PE2 due
Thu 2 Dec CLASSIFICATION 3. Text classification. Exploiting text-specific features. Feature selection. Evaluation of classification. Micro- and macro-averaging. Comparative results. [powerpoint]
CM Machine Learning in Automated Text Categorization
A Re-examination of Text Categorization Methods
A Comparative Study on Feature Selection in Text Categorization
Fri 3 Dec Review session for final [doc]
Thu 9 Dec Final Exam 12:15pm-3:15pm

Back to the CS276A home page