CS276B
Text Information Retrieval, Mining, and Exploitation
Winter 2003

Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze

Project Information

This page contains information about the design of the class project. Please direct all questions about this information to the TA for the class, Teg Grenager.

Project Overview

The inspiration for this quarter's class project comes particularly from the CiteSeer Scientific Literature Library at the NEC Research Institute. CiteSeer is a publicly available digital library which indexes the academic research papers that are available on the World Wide Web. However, there are also other interesting online academic literature repositories, such as the local Highwire Press, an offshoot of the Stanford libraries. Earlier predecessors which implemented similar kinds of functionality were Andrew Ng's ML Papers, and Cora.

While CiteSeer is a very useful research tool, we believe that it leaves a lot of opportunities for improvement, and one goal of the class project is to collectively produce a system which performs better than CiteSeer, in at least some respects. For lack of a better name, we will start off by calling our new and improved system CiteUnseen. We're looking for suggestions for a better name from you!

The other, and more important, goal of this project is that this problem provides an application wherein one can explore most of the important technologies discussed in the course: text clustering, classification, information extraction, link-based analysis, collaborative filtering, and various forms of text mining. We hope that the project can be a good instance of problem-based learning: you will get to acquire skills and see their importance in a real context. We will attempt to provide appropriate resources and pointers about what you should be doing, but an important part of this is that you should ask questions if there are things that you feel that you need to know more about. We can hopefully at least point you in useful directions.

Because the scope of the project is relatively large, we do not expect small groups of students to implement their own version of CiteUnseen. Instead, we have developed APIs and data structures that (hopefully) will allow individual teams to work on separate subproblems, while enforcing overall interoperability. Each small team will have the opportunity to work on more than one project over the course of the quarter, and will do a combination of vital infrastructure and research into a problem of interest. We're going to write the system in Java. We've tried to outline components and database structures for the system below. However, we're not perfect (and haven't actually built a complete system before the class began, though we did play with components), and we may well have missed some things. So, if, when you start to build things, some things seem missing, or something seems wrong in our suggested organization or approach, do let us know. You might well be right.

When complete, CiteUnseen should provide users with the following functionality (and perhaps some other good ideas we come up with):

Search papers by title, author, journal, date, or content
Find citing papers
Find cited papers
Get a canonical bibliographic citation for a paper (e.g., as a BibTeX entry)
Find similar papers
Rank papers by reputability (similar to Google’s PageRank)
Rank researchers or conferences/journals by impact.

CiteSeer Successess and Shortcomings

CiteSeer has been a very successful and widely used tool that is valued by academic researchers. Apparently crucial to its success has been the fact that it finds and indexes content (like a web search engine crawler), rather than requiring researchers to deposit materials in an institutional or other online paper server. Beyond that, it offers facilities such as locating citations of papers and measurements of the impact of authors which were previously only available through laborious library research (using traditional citation services like Science Citation Index). On the other hand, there are many shortcomings and opportunities to improve on the kind of service it provides. Problems include (in no particular order):

CiteSeer does not accurately match citations to corresponding papers, so it maintains separate citations and documents databases, which are searchable separately. This gives an interface which is somewhat awkward and confusing.
CiteSeer does not accurately parse the individual fields within citations (e.g., author, title, year, journal, etc.), so it has difficulty identifying duplicate citations as being citations of the same document -- the database often has duplicate versions of the same paper with variant citation details. It is also unable to give a complete canonical version of a reference as structured bibliographic data (such as used by programs like BibTeX).
CiteSeer does only limited link analysis of the citations of its papers. It could be extended in various ways such as ranking conferences.
The Boolean text search model used is quite ineffective for looking up papers by author (one of the most common uses), because names are often abbreviated to initials or omitted. The result is that people have to write awkward Boolean queries (m jordan or michael jordan or m i jordan or michael i jordan, which even when done carefully cause you to lose on either precision (there are several m jordan's or recall (if you omit this form).
They don't appear to get bibliographic reference information from HTML files, although that information would commonly be more easily parsable and more detailed than the information contained in the papers.
Their coverage is basically only computer science papers (and other things that leak in, due to appearing on the same pages as computer science papers). If one aimed to cover a broader research space (and perhaps even if one didn't), it would be useful to be able to navigate the repository by subject areas.
etc.

Project Structure

The project will be divided into two major parts, corresponding to the first and the second half of the course. In the first part we will focus on developing the core infrastructure required to have a functioning system. We would like people to work in pairs on subprojects relevant to this goal. This part will be divided into two stages: 1A and 1B.

In stage 1A we will assign a specific system component to each group of students (we will try to take student preferences into account, but cannot guarantee that all students will get their most preferred choice). Students will have approximately 2 weeks to complete the projects in stage 1A. After all the groups submit their code for 1A, we will attempt to run the system as a whole, noting where it is broken or has performance problems.

In stage 1B we will assign new projects to the student groups, with the objective of adding functionality that was missing from 1A, as well as fixing problems and performance issues of 1A. There will be a week and a half to complete projects for 1B. At the conclusion of 1B we hope to have a functioning system that implements the main components of a paper and citation indexing and retrieval system. We would like to work with you to help achieve it. If there are things that you don't know how to do, please fell free to contact the TA and professors for any ideas they may have.

In part 2 of the project, you will have a chance to pick your own topic for study and development. The topic should be something that fits into and improves or extends upon the basic citation index system that was built in part 1. You will have five weeks to complete part 2, and at the end will present their projects (perhaps to a celebrity audience!). This is your opportunity to do a more detailed piece of research, which should be focussed on research results and performance analysis, rather than just getting things working. For example, in stage 1, we may have used fairly crude regular expression matching information extraction techniques to parse citations. This would be an opportunity to investigate alternative approaches to information extraction, and to find a better performing method. You will need to submit a write-up describing your experiments and findings, as a research paper.

We do want projects in part 2 be interoperable with the existing citation index system, and to facilitate this, we will have a checkpoint for integrating an early version of your new work, so that we have some time to evaluate and solve any interoperability concerns. We will also be pleased if people implement any miscellaneous bug fixes or performance enhancements to the general system that come up in the course of their project. We hope that at the end of the course we will have a working system that includes innovative new ideas for paper and citation finding, browsing, etc.

Throughout the project, we expect students to work in pairs (if there is an odd number of students, we will make accomodations). Students may change their groups between part 1 and part 2 of the project so as to find other partners who share their research interest. However, you will need to negotiate with other groups to come up with some rearrangement of people so one person isn't left stranded and unhappy.

In part 1, some projects require more work in IR/IE/NLP, and others require more work in systems and software engineering. Because both types of effort are critical to the success of this project, we will value equally these two types of contributions. For part 2, your project should definitely aim to pick up on some of the IR/IE/NLP topics discussed in the course.

General references

You can find papers about CiteSeer at: Steve Lawrence's NEC publications page
You can find papers on Cora, a similar effort at: Kamal Nigam's page.

Data Flow and Project 1A Responsibilities

The CiteUnseen application begins with virtually no internal data, and gradually builds up data structures that allow it to provide the functionality described above. In order to make the subproblems modular, we have designed a workflow of several stages, with an associated sequence of data structures, that we want programs to follow. The data structures have been designed to store more data than strictly necessary, to allow some flexibility in the actual algorithms used in the implementation. We define the workflow as follows:

Objective	Project(s)	Input Data	Output Data	Issues/Challenges	Team
A. Find pages containing links to academic papers (hub pages)	Build a (robust, scalable, good citizen) web crawler Enable "focused crawling" by estimating the relevance of unexplored links.	A simple data file of seed URLs/queries	Hub HTML files on disk Records in the `PageInstance` table	Maximize links explored given resource constraints Accurately estimate relevance of links Use the Google API and other tricks to find good start pages Get good URLs/search terms from subsequent phases	Omar Seyalo: seyalo@stanford.edu Steve Branson: sbranson@stanford.edu
B. Extract links from hub pages, download and classify them	Extract link and link context from hub pages Download PS/PDF/DOC/HTML of paper and convert to a standard marked-up format that preserves important formatting *Classify papers as academic or not using classifier (perhaps learned)	Hub HTML files on disk Records in the `PageInstance` table Records in the `PaperInstance` table	Records in the `PaperInstance` table Records in the `CitationInstance` table PS/PDF/DOC of papers on disk	Finding citation boundaries in the hub page Understanding PS/PDF/DOC formats Preserving the right level of formatting	Anton Ushakov: antonu@stanford.edu
C. Parse paper for coarse grained fields	Extract from paper coarse-grained fields such as title, authors, abstract, citations Separate citations section into individual citations	Records in the `PaperInstance` table	Records in the `PaperInstance` table Records in the `CitationInstance` table	Finding the right models for different field extraction tasks	Yang Huang: huangy@stanford.edu Fang Wei: zwei@stanford.edu
D. Process citations	Extract from citations the tag, author, date, title, journal, etc. Extract context for each citation from the paper body text	Records in the `CitationInstance` table Marked-up text version of papers on disk	Records in the `CitationInstance` table Records in the `CitationContextInstance` table	Getting them correct! Links to context in paper important	Haoyi Want: haoyiw@stanford.edu Zhen Yin: zhenyin@stanford.edu
E. Normalize and remove duplications, creating final data structures	* Perform author, journal, and institution normalization (map variant forms to canonical form) * Normalize citations to produce unique set of citations * Map citations to the papers to which they refer	Records in the `CitationInstance` and `PaperInstance` tables	Records in the `Author`, `Journal`, `Paper`, and `Citation` tables	Developing normalization techniques for this domain	Joseph Smarr: jsmarr@stanford.edu Tim Grow: grow@stanford.edu
F. Build front end	Build inverted index of citations and papers (using Lucene) Build user interface to Lucene and SQL database	Records in the `Author`, `Journal`, `Paper`, and `Citation` tables	Inverted index	Understanding Lucene Optimizing information retrieval system for needed functionality (e.g., searching withing fields) Working with J2EE, including servlets and jsp	Qi Su: qisu@stanford.edu Steve Ngai: sngai@stanford.edu

Data Structures

Unfortunately, it seems to be necessary to use three different types of data structures, all of them residing on disk: files, relational database tables, and an inverted index. We discuss each in turn below:

Files

These should be pretty self-explanatory. We choose to store both the postscript and the text file of the paper so that we can redo the conversion if we obtain better algorithms. All files are stored in a formal directory structure so that they can be found deterministically.

Web Pages, including Hub Pages (HTML)

These live in the base directory /afs/ir/class/cs276b/data/webPages. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH. Some of these web pages are also hub pages, and this is indicated not by the file names but by the records in the database.

Raw Papers (PS/PDF/DOC)

These live in the base directory /afs/ir/class/cs276b/data/rawPapers. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH.

Text Papers (text)

These live in the base directory /afs/ir/class/cs276b/data/textPapers. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH.

Relational Database Tables

The first five tables are used to store the raw data taken from the documents themselves, before removing duplicates.

PageInstance(id, url, filename, status, score, isHub)

id is a unique ID for this PageInstance
url is the URL where the page was found
filename is the local name of the file on disk
status = 0 if the page is not downloaded, 1 if the page is in the process of being downloaded, 2 if the page is downloaded but not classified, -2 if the page could not be downloaded, 3 if the page is classified but not processed, -3 if the page could not be classified, 4 if the page is fully processed, -4 if the page could not be processed.
score is the estimated relevance of the link (used by the focused crawler)
isHub is an integer representing the hub status of the PageInstance: 0 is not a hub, 1 is not a hub, without pdf/ps links, 2 is a hub with pdf/ps links, 3 is not a hub (hand-classified) and 4 is a hub (hand-classified).

PaperInstance(id, url, rawFilename, textFilename, status, author, title, abstract, citations, selfCitation, citationInstanceID, authorBegin, authorEnd, titleBegin, titleEnd, abstractBegin, abstractEnd, citationsBegin, citationsEnd, paperID)

id is a unique ID for this PaperInstance
url is the URL where this PaperInstance was found
rawFilename is the local filename where the original version of this paper is stored
textFilename is the local filename where the text version of this paper is stored
status = 0 if the paper is not downloaded, 1 if the paper is in the process of being downloaded, 2 if the paper is downloaded but not converted, -2 if the paper could not be downloaded, 3 if the paper is converted to text but not extracted, -3 if the paper could not be converted, 4 if the paper is extracted, -4 if the paper could not be extracted.
author is the extracted author field from this PaperInstance
title is the extracted title field from this PaperInstance
citations is the extracted citations section
selfCitation is the extracted field of citation information about this paper (if it exists)
citationInstanceID is the id of the citationInstance that contains the extracted information from the self citation (if it exists)
authorBegin is the index of the char in the text version of the paper where the author field begins
authorEnd is the index of the char in the text version of the paper where the author field ends
titleBegin
titleEnd
abstractBegin
abstractEnd
citationsBegin
citationsEnd
paperID is the id of the Paper record that this PaperInstance was resolved to
filetype is the type of text file generated: 0 is ps2ascii and 1 is pstotext for now

CitationInstance(id, fromPaperInstanceID, fromHubInstanceID, toPaperInstanceID, citationText, citationTag, author, title, date, publication, volume, pages, editor, publisher, citationID, paperID, status)

id is a unique ID for this PageInstance
fromPaperInstanceID is the PaperInstance that this citationInstance comes from, if it comes from a paper
fromHubInstanceID is the HubInstance that this citationInstance comes from, if it comes from a hub
toPaperInstanceID is the PaperInstance that this citationInstance points to, if known
citationText is the complete text of this citation
citationTag is the extracted tag of this citation, the representation used to refer to the citation from the body text
author is the extracted author field of this citation (containing possibly more than one author)
title is the extracted title of this citation
date is the extracted date
publication is the extracted publication field of this citation (may be a journal, conference proceedings, book, etc.)
volume is the extracted volume field
pages is the extracted pages field
editor is the extracted editor field
publisher is the extracted publisher field
citationID is the id of the Citation record that this CitationInstance was resolved to
paperID is the id of the Paper record that this CitationInstance was resolved to
publicationID is the id of the Publication that this CitationInstance was resolved to
status is the status of this PaperInstance: 0 if created but not processed, 1 if processed.

CitationContextInstance(citationInstanceID, paperInstanceID, contextBegin, contextEnd, context)

citationInstanceID is the citation that this context refers to
paperInstanceID is the paper that this context was found in
contextBegin is the index of the first character of the citation context
contextEnd is the index of the last character of the citation context
context is the extracted context of the citation reference

AuthorInstance(id, authorText, first, middle, last, suffix, citationInstanceID, paperInstanceID, authorID)

id is the unique ID for this AuthorInstance. It stands for one author occurring in one paper or citation
authorText is the complete text of this AuthorInstance (the full, unparsed name)
first is the extracted first name of this author
middle is the extracted middle name of this author
last is the extracted last name of this author
suffix is the extracted suffix of this author
citationInstanceID is the citationInstance that this author was found in, if it was found in a citation
paperInstanceID is the paperInstance that this AuthorInstance was found in, if it was found in a paper
authorID is the id of the Author that this AuthorInstance was resolved to

The latter seven tables are the final data structure from which duplicates have been removed and which is used to respond to user queries.

Paper(id, citationInstanceID, paperInstanceID)

id is the unique ID for this paper
citationInstanceID is the ID of the canonical citationInstance for this paper
paperInstanceID is the ID of the canonical paperInstance for this paper (if this is NULL, then the paper is NOT in the database)

Author(id, first, middle, last, suffix, email, affiliation)

id is the unique ID for this author
first is the canonical first name for this author
middle is the canonical middle name for this author
last is the canonical last name for this author
suffix
email
affiliation
canonicalName

Authorship(paperID, authorID)

paperID is the ID of the paper that was written
authorID is the ID of the author who wrote it

Name(id, altName, isCanonical)

id is the unique ID of an author first/middle name cluster
altName is a variation of the name in that cluster
isCanonical is true if this altName is the canonical form of this name

Publication(id, canonicalName)

id is the unique ID of this publication (may include journal, conference proceedings, book, etc.)
canonicalName is the canonical name of this publication

PublicationName(publicationID, altName)

publicationID is the ID of this publication
altName is an alternative name for this publication

Citation(fromPaperID, toPaperID, citationInstanceID)

fromPaperID is the CITING paper
toPaperID is the CITED paper
citationInstanceID is the id of the CitationInstance that is the canonical representation of this citation; this CitationInstance should have as its fromPaperInstanceID the same PaperInstance that is the canonical representation of the Paper given by fromPaperID in this record

We will use MySQL as our database management system. It manages the tables, and queries over the tables.

Inverted Index

We will build an inverted index over the text versions of the papers, where the papers have been divided up into fields including author, title, abstract, introduction, and references. We will use Lucene, an open-source Java-based indexing system developed by Apache to build and query the inverted index.

Part 2 Project Ideas

Part 2 of the project is your opportunity to do a more detailed piece of research, which should be focussed on novel research, interesting results and performance analysis, rather than just getting things working. For example, in stage 1, we may have used fairly crude regular expression matching information extraction techniques to parse citations. This would be an opportunity to investigate alternative approaches to information extraction, and to find a better performing method. For part 2, results of your experiments should be submitted as a (maximum 8 page) research paper, as in a conference proceedings. Any topic related to the course and the project is fine. The topic should be something that fits into and improves or extends upon the basic citation index system that was built in part 1.

There are many possibilities for the second part of the project. We list a few here, but our experience shows that students can often be much more creative than we are.

Investigate machine learning methods for better information extraction on the citations
Design an algorithm that exploits the fact that all citations in a single document should have (roughly) the same structure
Use the context information in the hub document as evidence for a documents bibliographic information
Look at parsing up people's CVs or publications pages, or wrapper induction for bibliographic resources such as DBLP in order to create a more comprehensive database of publications.
Create an interactive visualization of citation data (and/or of textual similarity), perhaps using ideas such as the hyperbolic tree browser
Run a version of Google's PageRank algorithm to compute the reputability of different papers, authors, conferences, etc.
Track the movement of people over time (based on affiliation).
Explore hierarchical text classification or clustering methods to build a topic hiearchy over the papers in the repository.
Build a collaborative filtering system where rankings of papers are explicitly or implicitly collected and used to recommend papers to other users.
Explore using reinforcement learning techniques for doing better focussed crawling of webpages.
etc.

Part 1A Subprojects: Detailed Descriptions

Our general aim is to build a well-engineered and scalable enough system that we can download, and do information extraction, retrieval, text clustering, etc. on a database of the order of a million research papers. This means that algorithms and methods will have to be chosen so that they scale sensible (e.g., an n³ clustering algorithm would not be a good choice). We're going to keep our CVS repository on the Leland systems, and in Part 1A of the project will develop and run code there. You should check code into CVS as soon as possible (as soon as it compiles...), since this will make it easier for other people to see what you're doing, and how things might work together. For reasons of disk storage alone, all tests at this stage will have to be small. We then plan to deploy and test the system on a Linux machine with plenty of disk space. (And so you should do anything that you think will make the system hard to port to Linux....)

A. Paper Web Crawler

A central requirement for this project is an efficient and robust web crawler which can initially find and download what we are calling "hub pages" -- here pages that contain one or more research papers linked off them. There are several important issues:

The biggest efficiency issue in this phase is downloading pages and papers at a sufficient rate. There are high network latencies in initiating connections and receiving data, and even though Stanford has a big fat pipe running into, many other machines are on slow networks. An efficient, high-performance system needs to be able to download multiple pages in parallel (by adopting a multi-threaded design in Java).
A web crawler has to be a good net citizen. See in particular http://www.robotstxt.org/. Your crawl must observe the robots.txt protocol, and only download things that the webserver owner wants robots to download. The robot should supply contact information. You must also limit the amount of stuff that you download from one site over a short period of time. This requires the project to use a round robin mode where it bounces around a list of sites it is downloading stuff from, rather than pulling everything off one site first in a "depth-first" manner. Finally, there must also be opportunities to control the overall rate of data downloading, so that Stanford doesn't complain at us either. For example, one should be able to throttle back the downloading rate to 1 megabyte a second or whatever.
Crawlers tend to die or go awry sometimes. It's useful if the crawler maintains external state in a database or files, so that one can restart the crawler without having to do too much repeated work.
Requirements of system load can change, so it'd be useful if one can change things such as the downloading rate while the crawler is running. This could be done by it receiving a signal, or periodically checking and reloading a config file.
At least in the first instance we want to build up a paper repository independently from the current Citeseer database. So, in particular, you should not crawl content from Citeseer. Note that they have several URLs (http://www.researchindex.org, http://citeseer.nj.nec.com/, http://citeseer.com). Some other well-known paper repositories, such as http://www.arxiv.org/ also disallow robot crawling (well, to be more precise, they allow Google some places but noone else...). In general, aiming to get papers from individual researcher's home pages seems the best way to get the widest and most up-to-date selection of papers.
You should look to send your robot on some small test crawls, downloading a hundred papers or two, but you should not prior to the checkpoint attempt to do large scale crawling. While debugging the crawler, you should be on-hand to make sure that it is not going haywire, and have it print adequate logging information so one can ensure that it is acting sensibly.

The first aim of the robot is to find and download HTML pages that contain citation information on academic works, including in particular "hub" pages which contain links to papers (in Postscript, PDF, or other formats). The central research challenge is to effectively find appropriate pages. One needs to start somewhere, and we imagine beginning with a small seed file, which could contains some suggested search engine queries and starting page URLs. For example, such a file might have:

Query: conference workshop papers pdf technical report
URL: http://nlp.stanford.edu/~manning/papers/

A central research question is how to do intelligent focussed crawling (or resource discovery) rather than simply blindly downloading linked HTML pages. A good performance metric might be the ratio of useful pages (ones with citation information, including hubs) to all pages downloaded. See the paper by Chakrabarti et al. on focused crawling.

The robot might want to adopt a variety of specialized crawling methods. For example, descending through the faculty and student pages of departments or universities, looking for papers on faculty and student pages or on a publications page linked off of them seems a good strategy. Another good strategy for resource discovery seems to be to take a paper title that you know, send it to a search engine, and to find other sites which store that paper. They may in turn have many other papers stored at them. There can thus be a feedback look between later stages of project processing (getting paper titles) feeding back into new things for the crawler to look for. A central need is that the system be able to discover over time new sites where academic papers are located, rather than simply finding things from a static collection.

A second task of the robot is to download the actual papers. These will be identified by another group, but it is the crawler's job to download them (while again observing constraints imposed by load limitations, robots.txt, etc.

Resources

There has been a considerable amount of work on writing web crawlers, and in particular some previous work using Java (some of it old versions of Java...). However, we're not aware of a publically available Java crawler. This work has focussed on standard web page crawling applications. Most such work maintains a list (representing a kind of breadth-first search) or URLs it knows about but hasn't downloaded, and then multiple crawler agents select new URLs to download based on some metric (such as the estimated pagerank of the pages). There are opportunities here to use cleverer metrics. Some work to be aware of includes:

Mercator Project. A Java web crawler. At what used to be the DEC, and then the Compaq, and now I guess the HP Systems Research Center. Related to Altavista.
UbiCrawler. A more modern Java web crawler, from Italy.
Junghoo Cho's publications. Junghoo was until recently a Stanford grad student, and worked on the WebBase project, in particular he focussed on the crawler used in that project.
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery by Chakrabarti, van den Berg and Dom. WWW8. And also see his project page.

B. Extract, Convert, Classify

The first part of this project is to take putative hub pages (ones that are meant to contain research papers on them) identified by the crawler, and to do further processing on these HTML pages, and then subsequently on the actually downloaded pages.

The first task is to extract links and link context from hub pages. Finding the links is straightforward HTML parsing ("A" elements), but there is ample room for creativity and cleverness in finding how much context to extract as representing citation information about the paper. Commonly the anchor text (stuff inside the "A" element) of a paper will just be the title, or maybe just PDF, and one will want to determine how much surrounding text provides citation information. There are a number of heuristics one might use (looking for certain HTML elements such as P, BR, LI as separators (but not others such as I, B, ü), and one could also hope to sanity check the content of the proposed citation: it should not be too big or too small, and it should look like a paper citation. This could be evaluated by a simple sequence model, such as a n-gram language model trained on a featural decomposition of words (number, capitalized, etc.).
A second task is to decide whether a link is to a paper. As a first cut, taking all PS and PDF seems hopeful, but there are gotchas in both directions: some research papers appear in HTML, RTF, DOC, or other formats, and many other documents such as product manuals and advertising copy appear in PS or PDF. Classification of something as a research paper or not can be done based just on the citation context, based on looking at the paper content, on considering both jointly, or twice considering each in turn. There are important performance advantages to at least tentatively deciding the answer based on the link and context: many PDF and PS files are very large, and it would be best not to download ones that are going to turn out to be 10 meg product brochures.
This process will put records in the CitationInstance table, which do not yet have local downloaded files associated with them (in the PaperInstance relation). It will then be the crawler's job to occasionally download some of those files.
Once there are downloaded papers, the next job is to convert files from PS and PDF (and perhaps other formats) to more textual format from which we can do tasks such as text classification, clustering, and information extraction. There are a variety of tools for converting these documents to text. See below.
Once papers are in a text format, one may again wish to run a classifier over them to confirm that they are indeed research papers. This could be done as a straightforward text classifier voting for research paper or not (such as a Naive Bayes text classifier). An issue to be aware of is that research papers over different domains are themselves a quite heterogeneous group, as are non-research papers, and so this may make classification into the two classes somewhat difficult. Citeseer papers describe using just a very simple heuristic whereby anything with a block of references at the end counts as an academic paper!

Resources

Many people have written tools for converting PS files to text. Most of them are wrappers around the ghostscript ps2ascii.ps, which is actually a Postscript program that gets text out of another Postscript file. (All Postscript files are essentially programs, written in a stack-based language that resembles Forth, for those of you that have run across that before.) A couple of people have written their own such postscript programs. We hope to essentially be able to use these tools. We've installed under /afs/ir/class/cs276b/software various such tools. (We've also spent a fair bit of time looking at them: you should come talk to Teg or Chris for a bit of a brain-dump about what we currently know, though there are some more details below.) The main practical issue is whether to just use the plain text format (which gives less information but is easier to look at) or the richer representation which includes font change information, etc. This should make subsequent information extraction easier. We may want to use both for different purposes. This should be negotiated with the next group. It's probably best to start with just plain text

For PDF files, there is similarly a pdftotext program, available on the Leland systems. It produces very plain text files. We're not currently aware of a free program that produces a text version of PDF files with some more font and markup information. But it'd be nice if there were one and you could find it!

More on what we know
You can find programs that we've looked at in either the software or software/chris directories. Most of them we have compiled and added some simple setup shell scripts to. Extracting text from a PS program is a messy heuristic business, and the frustrating thing is that all the programs seem to work better for some bits of the problem and worse in other places. For the first phase, we should probably stick to using one that seems best, or perhaps doing a meta-chooser based on the output of several. A first task is to evaluate which one seems most reliable in general. We've only done that for a very few files. A few comparisons on a few files follows:

Name	OK on 7.ps	OK on 8.ps	OK on tense.ps	OK on gi.ps	OK on gi.pdf	OK on alg.pdf	Ligatures	Notes
gs-8.0 ps2ascii	OK-ish	OK-ish	OK	OK	OK	OK	Yes (except 7)	Generally stable. Only 7 bit ASCII. Can decode all charsets. Doesn't put in line breaks as well as pstotext, e.g., on algthatlearns.pdf. Has modes where more detailed info about font changes etc. can be output. Would need decoding.
gs-6.0 ps2ascii	OK-ish	OK-ish	OK	OK	Yes	Yes	Yes (except 7)	version installed on leland. Seems same as 8.0.
gs-4.03-ang ps2ascii	No!	No!	Yes	Yes	Yes	Yes	Bad for TeX OT1; okay on Type 1	Andrew once upon a time did some work to put extra info on font changes and line breaks into text, but it would seem that one would need to port the useful parts to a more modern version of ghostscript for good coverage. He also made it abort processing after the first 3 pages, but this part could be turned off.
pstotext	Gibberish	OK-ish	OK	OK	Yes	Yes	Yes (TeX OT1, Type 1)	Seems fairly robust. Gives page breaks, but little else. Puts section headings and title lines on a line by themselves more robustly than gs8 (see tense.ps or algthatlearns.pdf).
prescript 0.1	No!	No!	Yes	Yes	No!	No!	Some (TeX OT1, not Type 1)	A bit above 2.2. Has html mode which puts in paragraph marks, but really no different to double blank lines in text mode.
prescript 2.2	No!	No!	No!	Yes	No!	No!	No	Doesn't seem to live up to the webpage hype for the few pages I tried. Has html mode which puts in paragraph marks, but really no different to double blank lines in text mode.
pdftotext	No!	No!	No!	No!	Yes	Yes	Yes	Installed on Leland machines. Only for pdf, but for those, it seems to do a slightly better job with headings, section titles etc. than anything else. Just plain text.

(Other thinks I looked at (the Crete ps2html, old dmjones code, old JHU ps2html) doesn't seem worth exploring further.)

C. Parsing and information extraction of paper info

The goal here is to do what is sometimes called text zoning: working out the status of larger blocks of text. The main blocks initially of interest are:

Header (title) material
Abstract
Block of citations

As a more detailed information extraction task, part of this project is to do actual information extraction of paper titles and authors.

There are various ways that one can approach this task. For identifying names, knowing names is very useful. However, there are many names, especially as the range of nationalities grows, and so one wants to be able to overall identify something that looks like a name versus things that look like paper titles. N-gram language model methods (and sensitivity to features like capitalization) can be very effective here.

Resources

There are large lists of names available. One from the Census is at: /afs/ir/class/cs276b/data/census1990names/.
This problem was particularly studied (using machine learning methods) by the Cora project. You can find a tagged data set from Kristie Seymore's page.

D. Information Extraction of Citations

A central need is getting accurate information about other papers cited in a particular paper. This is a reasonably structured information extraction task. The task is first a segmentation task of separating out individual citations (line break information is useful here, but not sufficient by itself), and then separating the citation text into fields of author, year, title, etc. Many of these fields are fairly obvious from their content, but sequence information is also an important indicator. This extracted information will then be put in the database for each citation. A later phase will try to collapse (usually variant) citations of the same work in different papers.

We'd like to be able to give snippets of text of the context in which a work is cited in the text. This crucially involves finding the key used to cite the citation in the references, and then locating it in the text. The key may be a number with symbols like [1], an alphanumeric key, like [Man99], or something constructed from author and year, like Manning (1999) which may be cited in text in several different ways: Manning (1999) or (Manning 1999) or (Manning 1995, 1999).

Resources

Some of it is a little old, but a useful general survey of information extraction is one by Appelt and Israel. It's more focused on sentences, and NLP-based approaches. Much of what we're doing is semi-structured data, and work on wrapper generation or wrapper induction for web sources is also of some relevance (putting either of those terms into a web search engine brings up relevant results).
There are papers that directly address our problem of getting information out of reference lists.
- There is general discussion of it in the Citeseer papers. They appear to use a fairly heuristic approach and regular expression matching. Regular expressions can be very effective for getting information out of fairly structure text like reference lists.
- This problem was particularly studied (using machine learning methods) by the Cora project. You can find a tagged data set from Kristie Seymore's page. You can also find on that page a paper describing her work on using HMMs for such problems. the CMU Web->KB project page. We've put a copy of these datasets at: /afs/ir/class/cs276b/data/Kristie-Seymore-IE.
- Borkar, Deshmukh and Sarawagi, SIGMOD 2001 (which you can find here also discuss HMM-based approaches, and look among other things at parsing citations, and note limitations of the Citeseer approach.
We can make available Java code for HMMs that should be usable for this project. Just ask!
There are also various other approaches to information extraction. See: Ion Muslea: "Extraction Patterns for Information Extraction Tasks: A Survey", AAAI-99 Workshop on Machine Learning for Information Extraction.

E. Normalize and Remove Duplicates

A central problem in this domain is the of identifying instances of the same paper, person, or journal, despite the fact that they are described in variant ways. This problem is an instance of the data association problem, that of associating many different observations of an object to the object itself. In this case, the objects are papers, people, and journals, and the observations are the instances of the papers and citations that we find on the Internet.

This same problem turns up in many other domains where a lot of other places where one has to deal with messy, real-world observations including:

Removal of duplicates when cleaning databases, such as mailing lists.
Coreference detection in NLP systems: where muliple terms in a document or discourse refer to the same semantic object.
Tracking vehicles on a road using images from different cameras. (See http://citeseer.nj.nec.com/129175.html for more information.)

The task in this project is to create a database that contains unique papers, with associated authors and journals as first order objects. The starting point is the database produced by the preceding projects, which includes relations like PaperInstance and CitationInstance. Some of the specific challenges in this part of the project will be:

Normalization. Deciding that "Manning", "C. Manning", "Chris Manning", "Christopher D. Manning" all refer to the same person. Likewise, deciding that "ACL '95" and "33rd Annual Meeting of the Association for Computational Linguistics" are the same! This may involve some supervised learning.
Clustering. Once the names and journals have been normalized, deciding which citations actually point to the same paper. This may involve some data clustering. For example we would like the references
```
Rosenblatt F. (1961).  Principles of Neurodynamics: Perceptrons and
the Theory of Brain Mechanisms. Spartan Books, Washington, D.C.

[97] Rosenblatt, F. (1962). Principles of Neurodynamics. Washington, DC:Spartan
```
to appear in the same cluster.
Deciding which (if any) of the papers we have actually collected is pointed to by this cluster of citations!

Steve Lawrence (developer of CiteSeer) describes the approaches he tried in his paper on the subject, "Autonomous Citation Matching", located at http://www.neci.nec.com/~lawrence/pub-ri.html:

Edit distance measures
Word frequency or word occurance measures (such as tfidf)
Subfields or structure of the data

It is important to note that he is beginning from "unfielded" data, or citation "blobs" with no internal structure, while this group can count on the existence of fields that have been extracted by the other groups mentioned above.

F. Front End Interface

This system needs a good UI to be effective ... and arguably this is one of the places where Citeseer is more lacking. Certainly, other sites, such as Hirewire, are trying to do rather more with the user interface, and there are a number of other ideas one might try. But, first things first, we need some UI. Indeed, we've adopted a slighly large definition of what is the front end: this includes the text indexing of papers. Key components are:

Build an inverted index of citations and papers (using Lucene). This is the task we spent most time on last quarter, and we spent a little time looking at Lucene. Lucene provides straigthtforward ways to build an index, but there are a number of issues that can be investigated for optimum performance, including regular IR issues (stopwords? lowercasing?) and issues particular to the structure of these documents (they have fields like abstracts: Lucene supports searching over this sort of fielded text, and we should take advantage of it).
A lot of searches are for titles or authors and we should take advantage of the database where possible to do this more accurately. This is an example of the kind of parametric search project which we discussed last quarter.
The project should have a web search interface. This is probably most naturally done using JSP/Tomcat (talk to Teg about this!). You should aim for an interface which is natural, and easily supports the kinds of questions that people would commonly want to ask.

Resources

Most of the discussion of user interfaces and parametric search in cs276A is relevant.

Back to the CS276B homepage
Last modified: Wed Jan 8 11:00:59 PST 2003

CS276B Text Information Retrieval, Mining, and Exploitation Winter 2003

Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze

Project Information

Contents:

Project Overview

CiteSeer Successess and Shortcomings

Project Structure

General references

Data Flow and Project 1A Responsibilities

Data Structures

Files

Web Pages, including Hub Pages (HTML)

Raw Papers (PS/PDF/DOC)

Text Papers (text)

Relational Database Tables

Inverted Index

Part 2 Project Ideas

Part 1A Subprojects: Detailed Descriptions

A. Paper Web Crawler

Resources

B. Extract, Convert, Classify

Resources

C. Parsing and information extraction of paper info

Resources

D. Information Extraction of Citations

Resources

E. Normalize and Remove Duplicates

F. Front End Interface

Resources

CS276B
Text Information Retrieval, Mining, and Exploitation
Winter 2003