CS276A / LING 239I
Note: this is the Fall 2004 course website; the current Fall 2005 CS276 website is at: http://cs276.stanford.edu
Lectures are also available online and on television through SCPD/SITN.
Text information retrieval systems; efficient text indexing; Boolean, vector space, and probabilistic retrieval models; ranking and rank aggregation; evaluating IR systems. Text clustering and classification methods: Latent semantic indexing, taxonomy induction, cluster labeling; classification algorithms and their evaluation, text filtering and routing.
A note on structure: This year, we're teaching a two quarter sequence (CS276A/B) on information retrieval, text, and web page mining, somewhat similarly to in 2002-03, whereas in 2003-04, there was a compressed one quarter course (CS276). The organization this year is a little different however: this year, the first course will focus on information retrieval, and the text mining problems of text clustering and classification. This course will have homeworks, practical exercises and exams, but no large project. The second course will focus on areas like the web and XML, and will be a large project course.
For CS276A, we're not having an official textbook (there isn't one with good coverage of all and only the topics we'll discuss), but the books listed remain good references. Managing Gigabytes is particularly good for technical IR in the first part of the course, but doesn't cover topics in the second half of the course.
CS 103B and CS 107, and any one of CS 121, CS 145, or CS 161, or equivalent background.
Programming experience will be necessary for the two practical exercises.
Problem set #1
Practical exercise #1
Problem set #2
Practical exercise #2