Computer Systems Laboratory Colloquium

4:15PM, Wednesday, February 03, 1999
NEC Auditorium, Gates Computer Science Building B03

Less is more
Eliminating index terms from subordinate clauses

Simon Corston-Oliver
Microsoft Research
About the talk:
Conventional approaches to using Natural Language Processing (NLP) in Information Retrieval (IR) have focused on creating additional, complex linguistic terms for indexing and matching, with mixed results. We perform a linguistic analysis of documents during indexing for information retrieval and eliminate index terms that occur only in subordinate clauses. Index size is reduced by approximately 30% without adversely affecting precision or recall. These results hold for two corpora: a sample of the world wide web and an electronic encyclopedia. I give a brief overview of the theoretical basis for the approach taken and the application of the Microsoft English Grammar, then move to quantifying the impact on index size, precision, and recall when index terms are eliminated from various linguistic contexts.

About the speaker:

Simon Corston-Oliver received his Ph.D. from the University of California at Santa Barbara in 1998. His research interests center around the issue of "aboutness": modeling the discourse structure of written texts, measuring how entities are introduced and tracked, and looking for structural regularities that correspond to our sense of a document's global topic. He has a web page at http://research.microsoft.com/~simonco.

Contact information:

Simon Corston-Oliver
One Microsoft Way
Redmond
WA 98052
1-425-703-7371
1-425-936-7329
simonco@microsoft.com