Stanford University

CS276B
Text Information Retrieval, Mining, and Exploitation
Winter 2003

Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze


Project Information

This page contains information about the design of the class project. Please direct all questions about this information to the TA for the class, Teg Grenager.


Contents:


Project Overview

The inspiration for this quarter's class project comes particularly from the CiteSeer Scientific Literature Library at the NEC Research Institute. CiteSeer is a publicly available digital library which indexes the academic research papers that are available on the World Wide Web. However, there are also other interesting online academic literature repositories, such as the local Highwire Press, an offshoot of the Stanford libraries. Earlier predecessors which implemented similar kinds of functionality were Andrew Ng's ML Papers, and Cora.

While CiteSeer is a very useful research tool, we believe that it leaves a lot of opportunities for improvement, and one goal of the class project is to collectively produce a system which performs better than CiteSeer, in at least some respects. For lack of a better name, we will start off by calling our new and improved system CiteUnseen. We're looking for suggestions for a better name from you!

The other, and more important, goal of this project is that this problem provides an application wherein one can explore most of the important technologies discussed in the course: text clustering, classification, information extraction, link-based analysis, collaborative filtering, and various forms of text mining. We hope that the project can be a good instance of problem-based learning: you will get to acquire skills and see their importance in a real context. We will attempt to provide appropriate resources and pointers about what you should be doing, but an important part of this is that you should ask questions if there are things that you feel that you need to know more about. We can hopefully at least point you in useful directions.

Because the scope of the project is relatively large, we do not expect small groups of students to implement their own version of CiteUnseen. Instead, we have developed APIs and data structures that (hopefully) will allow individual teams to work on separate subproblems, while enforcing overall interoperability. Each small team will have the opportunity to work on more than one project over the course of the quarter, and will do a combination of vital infrastructure and research into a problem of interest. We're going to write the system in Java. We've tried to outline components and database structures for the system below. However, we're not perfect (and haven't actually built a complete system before the class began, though we did play with components), and we may well have missed some things. So, if, when you start to build things, some things seem missing, or something seems wrong in our suggested organization or approach, do let us know. You might well be right.

When complete, CiteUnseen should provide users with the following functionality (and perhaps some other good ideas we come up with):


CiteSeer Successess and Shortcomings

CiteSeer has been a very successful and widely used tool that is valued by academic researchers. Apparently crucial to its success has been the fact that it finds and indexes content (like a web search engine crawler), rather than requiring researchers to deposit materials in an institutional or other online paper server. Beyond that, it offers facilities such as locating citations of papers and measurements of the impact of authors which were previously only available through laborious library research (using traditional citation services like Science Citation Index). On the other hand, there are many shortcomings and opportunities to improve on the kind of service it provides. Problems include (in no particular order):


Project Structure

The project will be divided into two major parts, corresponding to the first and the second half of the course. In the first part we will focus on developing the core infrastructure required to have a functioning system. We would like people to work in pairs on subprojects relevant to this goal. This part will be divided into two stages: 1A and 1B.

In stage 1A we will assign a specific system component to each group of students (we will try to take student preferences into account, but cannot guarantee that all students will get their most preferred choice). Students will have approximately 2 weeks to complete the projects in stage 1A. After all the groups submit their code for 1A, we will attempt to run the system as a whole, noting where it is broken or has performance problems.

In stage 1B we will assign new projects to the student groups, with the objective of adding functionality that was missing from 1A, as well as fixing problems and performance issues of 1A. There will be a week and a half to complete projects for 1B. At the conclusion of 1B we hope to have a functioning system that implements the main components of a paper and citation indexing and retrieval system. We would like to work with you to help achieve it. If there are things that you don't know how to do, please fell free to contact the TA and professors for any ideas they may have.

In part 2 of the project, you will have a chance to pick your own topic for study and development. The topic should be something that fits into and improves or extends upon the basic citation index system that was built in part 1. You will have five weeks to complete part 2, and at the end will present their projects (perhaps to a celebrity audience!). This is your opportunity to do a more detailed piece of research, which should be focussed on research results and performance analysis, rather than just getting things working. For example, in stage 1, we may have used fairly crude regular expression matching information extraction techniques to parse citations. This would be an opportunity to investigate alternative approaches to information extraction, and to find a better performing method. You will need to submit a write-up describing your experiments and findings, as a research paper.

We do want projects in part 2 be interoperable with the existing citation index system, and to facilitate this, we will have a checkpoint for integrating an early version of your new work, so that we have some time to evaluate and solve any interoperability concerns. We will also be pleased if people implement any miscellaneous bug fixes or performance enhancements to the general system that come up in the course of their project. We hope that at the end of the course we will have a working system that includes innovative new ideas for paper and citation finding, browsing, etc.

Throughout the project, we expect students to work in pairs (if there is an odd number of students, we will make accomodations). Students may change their groups between part 1 and part 2 of the project so as to find other partners who share their research interest. However, you will need to negotiate with other groups to come up with some rearrangement of people so one person isn't left stranded and unhappy.

In part 1, some projects require more work in IR/IE/NLP, and others require more work in systems and software engineering. Because both types of effort are critical to the success of this project, we will value equally these two types of contributions. For part 2, your project should definitely aim to pick up on some of the IR/IE/NLP topics discussed in the course.


General references


Data Flow and Project 1A Responsibilities

The CiteUnseen application begins with virtually no internal data, and gradually builds up data structures that allow it to provide the functionality described above. In order to make the subproblems modular, we have designed a workflow of several stages, with an associated sequence of data structures, that we want programs to follow. The data structures have been designed to store more data than strictly necessary, to allow some flexibility in the actual algorithms used in the implementation. We define the workflow as follows:

Objective Project(s) Input Data Output Data Issues/Challenges Team

A. Find pages containing links to academic papers (hub pages)

  1. Build a (robust, scalable, good citizen) web crawler
  2. Enable "focused crawling" by estimating the relevance of unexplored links.
  • A simple data file of seed URLs/queries
  • Hub HTML files on disk
  • Records in the PageInstance table
  • Maximize links explored given resource constraints
  • Accurately estimate relevance of links
  • Use the Google API and other tricks to find good start pages
  • Get good URLs/search terms from subsequent phases
Omar Seyalo: seyalo@stanford.edu
Steve Branson: sbranson@stanford.edu

B. Extract links from hub pages, download and classify them

  1. Extract link and link context from hub pages
  2. Download PS/PDF/DOC/HTML of paper and convert to a standard marked-up format that preserves important formatting
  3. *Classify papers as academic or not using classifier (perhaps learned)
  • Hub HTML files on disk
  • Records in the PageInstance table
  • Records in the PaperInstance table
  • Records in the PaperInstance table
  • Records in the CitationInstance table
  • PS/PDF/DOC of papers on disk
  • Finding citation boundaries in the hub page
  • Understanding PS/PDF/DOC formats
  • Preserving the right level of formatting
Anton Ushakov: antonu@stanford.edu

C. Parse paper for coarse grained fields

  1. Extract from paper coarse-grained fields such as title, authors, abstract, citations
  2. Separate citations section into individual citations
  • Records in the PaperInstance table
  • Records in the PaperInstance table
  • Records in the CitationInstance table
  • Finding the right models for different field extraction tasks
Yang Huang: huangy@stanford.edu
Fang Wei: zwei@stanford.edu

D. Process citations

  1. Extract from citations the tag, author, date, title, journal, etc.
  2. Extract context for each citation from the paper body text
  • Records in the CitationInstance table
  • Marked-up text version of papers on disk
  • Records in the CitationInstance table
  • Records in the CitationContextInstance table
  • Getting them correct!
  • Links to context in paper important
Haoyi Want: haoyiw@stanford.edu
Zhen Yin: zhenyin@stanford.edu

E. Normalize and remove duplications, creating final data structures

  1. * Perform author, journal, and institution normalization (map variant forms to canonical form)
  2. * Normalize citations to produce unique set of citations
  3. * Map citations to the papers to which they refer
  • Records in the CitationInstance and PaperInstance tables
  • Records in the Author, Journal, Paper, and Citation tables
  • Developing normalization techniques for this domain
Joseph Smarr: jsmarr@stanford.edu
Tim Grow: grow@stanford.edu

F. Build front end

  1. Build inverted index of citations and papers (using Lucene)
  2. Build user interface to Lucene and SQL database
  • Records in the Author, Journal, Paper, and Citation tables
  • Inverted index
  • Understanding Lucene
  • Optimizing information retrieval system for needed functionality (e.g., searching withing fields)
  • Working with J2EE, including servlets and jsp
Qi Su: qisu@stanford.edu
Steve Ngai: sngai@stanford.edu




Data Structures

Unfortunately, it seems to be necessary to use three different types of data structures, all of them residing on disk: files, relational database tables, and an inverted index. We discuss each in turn below:

Files

These should be pretty self-explanatory. We choose to store both the postscript and the text file of the paper so that we can redo the conversion if we obtain better algorithms. All files are stored in a formal directory structure so that they can be found deterministically.

Web Pages, including Hub Pages (HTML)

These live in the base directory /afs/ir/class/cs276b/data/webPages. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH. Some of these web pages are also hub pages, and this is indicated not by the file names but by the records in the database.

Raw Papers (PS/PDF/DOC)

These live in the base directory /afs/ir/class/cs276b/data/rawPapers. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH.

Text Papers (text)

These live in the base directory /afs/ir/class/cs276b/data/textPapers. A file with name ABCDEFGH lives at path AB/CD/EF/GH/ABCDEFGH.

Relational Database Tables

The first five tables are used to store the raw data taken from the documents themselves, before removing duplicates.

PageInstance(id, url, filename, status, score, isHub)

PaperInstance(id, url, rawFilename, textFilename, status, author, title, abstract, citations, selfCitation, citationInstanceID, authorBegin, authorEnd, titleBegin, titleEnd, abstractBegin, abstractEnd, citationsBegin, citationsEnd, paperID)

CitationInstance(id, fromPaperInstanceID, fromHubInstanceID, toPaperInstanceID, citationText, citationTag, author, title, date, publication, volume, pages, editor, publisher, citationID, paperID, status)

CitationContextInstance(citationInstanceID, paperInstanceID, contextBegin, contextEnd, context)

AuthorInstance(id, authorText, first, middle, last, suffix, citationInstanceID, paperInstanceID, authorID)

The latter seven tables are the final data structure from which duplicates have been removed and which is used to respond to user queries.

Paper(id, citationInstanceID, paperInstanceID)

Author(id, first, middle, last, suffix, email, affiliation)

Authorship(paperID, authorID)

Name(id, altName, isCanonical)

Publication(id, canonicalName)

PublicationName(publicationID, altName)

Citation(fromPaperID, toPaperID, citationInstanceID)

We will use MySQL as our database management system. It manages the tables, and queries over the tables.

Inverted Index

We will build an inverted index over the text versions of the papers, where the papers have been divided up into fields including author, title, abstract, introduction, and references. We will use Lucene, an open-source Java-based indexing system developed by Apache to build and query the inverted index.


Part 2 Project Ideas

Part 2 of the project is your opportunity to do a more detailed piece of research, which should be focussed on novel research, interesting results and performance analysis, rather than just getting things working. For example, in stage 1, we may have used fairly crude regular expression matching information extraction techniques to parse citations. This would be an opportunity to investigate alternative approaches to information extraction, and to find a better performing method. For part 2, results of your experiments should be submitted as a (maximum 8 page) research paper, as in a conference proceedings. Any topic related to the course and the project is fine. The topic should be something that fits into and improves or extends upon the basic citation index system that was built in part 1.

There are many possibilities for the second part of the project. We list a few here, but our experience shows that students can often be much more creative than we are.


Part 1A Subprojects: Detailed Descriptions

Our general aim is to build a well-engineered and scalable enough system that we can download, and do information extraction, retrieval, text clustering, etc. on a database of the order of a million research papers. This means that algorithms and methods will have to be chosen so that they scale sensible (e.g., an n3 clustering algorithm would not be a good choice). We're going to keep our CVS repository on the Leland systems, and in Part 1A of the project will develop and run code there. You should check code into CVS as soon as possible (as soon as it compiles...), since this will make it easier for other people to see what you're doing, and how things might work together. For reasons of disk storage alone, all tests at this stage will have to be small. We then plan to deploy and test the system on a Linux machine with plenty of disk space. (And so you should do anything that you think will make the system hard to port to Linux....)

A. Paper Web Crawler

A central requirement for this project is an efficient and robust web crawler which can initially find and download what we are calling "hub pages" -- here pages that contain one or more research papers linked off them. There are several important issues:

  1. The biggest efficiency issue in this phase is downloading pages and papers at a sufficient rate. There are high network latencies in initiating connections and receiving data, and even though Stanford has a big fat pipe running into, many other machines are on slow networks. An efficient, high-performance system needs to be able to download multiple pages in parallel (by adopting a multi-threaded design in Java).
  2. A web crawler has to be a good net citizen. See in particular http://www.robotstxt.org/. Your crawl must observe the robots.txt protocol, and only download things that the webserver owner wants robots to download. The robot should supply contact information. You must also limit the amount of stuff that you download from one site over a short period of time. This requires the project to use a round robin mode where it bounces around a list of sites it is downloading stuff from, rather than pulling everything off one site first in a "depth-first" manner. Finally, there must also be opportunities to control the overall rate of data downloading, so that Stanford doesn't complain at us either. For example, one should be able to throttle back the downloading rate to 1 megabyte a second or whatever.
  3. Crawlers tend to die or go awry sometimes. It's useful if the crawler maintains external state in a database or files, so that one can restart the crawler without having to do too much repeated work.
  4. Requirements of system load can change, so it'd be useful if one can change things such as the downloading rate while the crawler is running. This could be done by it receiving a signal, or periodically checking and reloading a config file.
  5. At least in the first instance we want to build up a paper repository independently from the current Citeseer database. So, in particular, you should not crawl content from Citeseer. Note that they have several URLs (http://www.researchindex.org, http://citeseer.nj.nec.com/, http://citeseer.com). Some other well-known paper repositories, such as http://www.arxiv.org/ also disallow robot crawling (well, to be more precise, they allow Google some places but noone else...). In general, aiming to get papers from individual researcher's home pages seems the best way to get the widest and most up-to-date selection of papers.
  6. You should look to send your robot on some small test crawls, downloading a hundred papers or two, but you should not prior to the checkpoint attempt to do large scale crawling. While debugging the crawler, you should be on-hand to make sure that it is not going haywire, and have it print adequate logging information so one can ensure that it is acting sensibly.

The first aim of the robot is to find and download HTML pages that contain citation information on academic works, including in particular "hub" pages which contain links to papers (in Postscript, PDF, or other formats). The central research challenge is to effectively find appropriate pages. One needs to start somewhere, and we imagine beginning with a small seed file, which could contains some suggested search engine queries and starting page URLs. For example, such a file might have:

Query: conference workshop papers pdf technical report
URL: http://nlp.stanford.edu/~manning/papers/

A central research question is how to do intelligent focussed crawling (or resource discovery) rather than simply blindly downloading linked HTML pages. A good performance metric might be the ratio of useful pages (ones with citation information, including hubs) to all pages downloaded. See the paper by Chakrabarti et al. on focused crawling.

The robot might want to adopt a variety of specialized crawling methods. For example, descending through the faculty and student pages of departments or universities, looking for papers on faculty and student pages or on a publications page linked off of them seems a good strategy. Another good strategy for resource discovery seems to be to take a paper title that you know, send it to a search engine, and to find other sites which store that paper. They may in turn have many other papers stored at them. There can thus be a feedback look between later stages of project processing (getting paper titles) feeding back into new things for the crawler to look for. A central need is that the system be able to discover over time new sites where academic papers are located, rather than simply finding things from a static collection.

A second task of the robot is to download the actual papers. These will be identified by another group, but it is the crawler's job to download them (while again observing constraints imposed by load limitations, robots.txt, etc.

Resources

There has been a considerable amount of work on writing web crawlers, and in particular some previous work using Java (some of it old versions of Java...). However, we're not aware of a publically available Java crawler. This work has focussed on standard web page crawling applications. Most such work maintains a list (representing a kind of breadth-first search) or URLs it knows about but hasn't downloaded, and then multiple crawler agents select new URLs to download based on some metric (such as the estimated pagerank of the pages). There are opportunities here to use cleverer metrics. Some work to be aware of includes:

  1. Mercator Project. A Java web crawler. At what used to be the DEC, and then the Compaq, and now I guess the HP Systems Research Center. Related to Altavista.
  2. UbiCrawler. A more modern Java web crawler, from Italy.
  3. Junghoo Cho's publications. Junghoo was until recently a Stanford grad student, and worked on the WebBase project, in particular he focussed on the crawler used in that project.
  4. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery by Chakrabarti, van den Berg and Dom. WWW8. And also see his project page.

B. Extract, Convert, Classify

The first part of this project is to take putative hub pages (ones that are meant to contain research papers on them) identified by the crawler, and to do further processing on these HTML pages, and then subsequently on the actually downloaded pages.

  1. The first task is to extract links and link context from hub pages. Finding the links is straightforward HTML parsing ("A" elements), but there is ample room for creativity and cleverness in finding how much context to extract as representing citation information about the paper. Commonly the anchor text (stuff inside the "A" element) of a paper will just be the title, or maybe just PDF, and one will want to determine how much surrounding text provides citation information. There are a number of heuristics one might use (looking for certain HTML elements such as P, BR, LI as separators (but not others such as I, B, ü), and one could also hope to sanity check the content of the proposed citation: it should not be too big or too small, and it should look like a paper citation. This could be evaluated by a simple sequence model, such as a n-gram language model trained on a featural decomposition of words (number, capitalized, etc.).
  2. A second task is to decide whether a link is to a paper. As a first cut, taking all PS and PDF seems hopeful, but there are gotchas in both directions: some research papers appear in HTML, RTF, DOC, or other formats, and many other documents such as product manuals and advertising copy appear in PS or PDF. Classification of something as a research paper or not can be done based just on the citation context, based on looking at the paper content, on considering both jointly, or twice considering each in turn. There are important performance advantages to at least tentatively deciding the answer based on the link and context: many PDF and PS files are very large, and it would be best not to download ones that are going to turn out to be 10 meg product brochures.
  3. This process will put records in the CitationInstance table, which do not yet have local downloaded files associated with them (in the PaperInstance relation). It will then be the crawler's job to occasionally download some of those files.
  4. Once there are downloaded papers, the next job is to convert files from PS and PDF (and perhaps other formats) to more textual format from which we can do tasks such as text classification, clustering, and information extraction. There are a variety of tools for converting these documents to text. See below.
  5. Once papers are in a text format, one may again wish to run a classifier over them to confirm that they are indeed research papers. This could be done as a straightforward text classifier voting for research paper or not (such as a Naive Bayes text classifier). An issue to be aware of is that research papers over different domains are themselves a quite heterogeneous group, as are non-research papers, and so this may make classification into the two classes somewhat difficult. Citeseer papers describe using just a very simple heuristic whereby anything with a block of references at the end counts as an academic paper!

Resources

Many people have written tools for converting PS files to text. Most of them are wrappers around the ghostscript ps2ascii.ps, which is actually a Postscript program that gets text out of another Postscript file. (All Postscript files are essentially programs, written in a stack-based language that resembles Forth, for those of you that have run across that before.) A couple of people have written their own such postscript programs. We hope to essentially be able to use these tools. We've installed under /afs/ir/class/cs276b/software various such tools. (We've also spent a fair bit of time looking at them: you should come talk to Teg or Chris for a bit of a brain-dump about what we currently know, though there are some more details below.) The main practical issue is whether to just use the plain text format (which gives less information but is easier to look at) or the richer representation which includes font change information, etc. This should make subsequent information extraction easier. We may want to use both for different purposes. This should be negotiated with the next group. It's probably best to start with just plain text

For PDF files, there is similarly a pdftotext program, available on the Leland systems. It produces very plain text files. We're not currently aware of a free program that produces a text version of PDF files with some more font and markup information. But it'd be nice if there were one and you could find it!

More on what we know
You can find programs that we've looked at in either the software or software/chris directories. Most of them we have compiled and added some simple setup shell scripts to. Extracting text from a PS program is a messy heuristic business, and the frustrating thing is that all the programs seem to work better for some bits of the problem and worse in other places. For the first phase, we should probably stick to using one that seems best, or perhaps doing a meta-chooser based on the output of several. A first task is to evaluate which one seems most reliable in general. We've only done that for a very few files. A few comparisons on a few files follows:

NameOK on 7.psOK on 8.psOK on tense.psOK on gi.psOK on gi.pdfOK on alg.pdfLigaturesNotes
gs-8.0 ps2ascii OK-ishOK-ishOKOK OKOK Yes (except 7)Generally stable. Only 7 bit ASCII. Can decode all charsets. Doesn't put in line breaks as well as pstotext, e.g., on algthatlearns.pdf. Has modes where more detailed info about font changes etc. can be output. Would need decoding.
gs-6.0 ps2ascii OK-ishOK-ishOKOK YesYes Yes (except 7)version installed on leland. Seems same as 8.0.
gs-4.03-ang ps2ascii No!No!YesYes YesYes Bad for TeX OT1; okay on Type 1Andrew once upon a time did some work to put extra info on font changes and line breaks into text, but it would seem that one would need to port the useful parts to a more modern version of ghostscript for good coverage. He also made it abort processing after the first 3 pages, but this part could be turned off.
pstotext GibberishOK-ishOKOK YesYes Yes (TeX OT1, Type 1)Seems fairly robust. Gives page breaks, but little else. Puts section headings and title lines on a line by themselves more robustly than gs8 (see tense.ps or algthatlearns.pdf).
prescript 0.1 No!No!YesYes No!No! Some (TeX OT1, not Type 1)A bit above 2.2. Has html mode which puts in paragraph marks, but really no different to double blank lines in text mode.
prescript 2.2 No!No!No!Yes No!No! NoDoesn't seem to live up to the webpage hype for the few pages I tried. Has html mode which puts in paragraph marks, but really no different to double blank lines in text mode.
pdftotext No!No!No!No! YesYes YesInstalled on Leland machines. Only for pdf, but for those, it seems to do a slightly better job with headings, section titles etc. than anything else. Just plain text.

(Other thinks I looked at (the Crete ps2html, old dmjones code, old JHU ps2html) doesn't seem worth exploring further.)

C. Parsing and information extraction of paper info

The goal here is to do what is sometimes called text zoning: working out the status of larger blocks of text. The main blocks initially of interest are:

  1. Header (title) material
  2. Abstract
  3. Block of citations

As a more detailed information extraction task, part of this project is to do actual information extraction of paper titles and authors.

There are various ways that one can approach this task. For identifying names, knowing names is very useful. However, there are many names, especially as the range of nationalities grows, and so one wants to be able to overall identify something that looks like a name versus things that look like paper titles. N-gram language model methods (and sensitivity to features like capitalization) can be very effective here.

Resources

D. Information Extraction of Citations

A central need is getting accurate information about other papers cited in a particular paper. This is a reasonably structured information extraction task. The task is first a segmentation task of separating out individual citations (line break information is useful here, but not sufficient by itself), and then separating the citation text into fields of author, year, title, etc. Many of these fields are fairly obvious from their content, but sequence information is also an important indicator. This extracted information will then be put in the database for each citation. A later phase will try to collapse (usually variant) citations of the same work in different papers.

We'd like to be able to give snippets of text of the context in which a work is cited in the text. This crucially involves finding the key used to cite the citation in the references, and then locating it in the text. The key may be a number with symbols like [1], an alphanumeric key, like [Man99], or something constructed from author and year, like Manning (1999) which may be cited in text in several different ways: Manning (1999) or (Manning 1999) or (Manning 1995, 1999).

Resources

E. Normalize and Remove Duplicates

A central problem in this domain is the of identifying instances of the same paper, person, or journal, despite the fact that they are described in variant ways. This problem is an instance of the data association problem, that of associating many different observations of an object to the object itself. In this case, the objects are papers, people, and journals, and the observations are the instances of the papers and citations that we find on the Internet.

This same problem turns up in many other domains where a lot of other places where one has to deal with messy, real-world observations including:

The task in this project is to create a database that contains unique papers, with associated authors and journals as first order objects. The starting point is the database produced by the preceding projects, which includes relations like PaperInstance and CitationInstance. Some of the specific challenges in this part of the project will be:

Steve Lawrence (developer of CiteSeer) describes the approaches he tried in his paper on the subject, "Autonomous Citation Matching", located at http://www.neci.nec.com/~lawrence/pub-ri.html:

It is important to note that he is beginning from "unfielded" data, or citation "blobs" with no internal structure, while this group can count on the existence of fields that have been extracted by the other groups mentioned above.

F. Front End Interface

This system needs a good UI to be effective ... and arguably this is one of the places where Citeseer is more lacking. Certainly, other sites, such as Hirewire, are trying to do rather more with the user interface, and there are a number of other ideas one might try. But, first things first, we need some UI. Indeed, we've adopted a slighly large definition of what is the front end: this includes the text indexing of papers. Key components are:

  1. Build an inverted index of citations and papers (using Lucene). This is the task we spent most time on last quarter, and we spent a little time looking at Lucene. Lucene provides straigthtforward ways to build an index, but there are a number of issues that can be investigated for optimum performance, including regular IR issues (stopwords? lowercasing?) and issues particular to the structure of these documents (they have fields like abstracts: Lucene supports searching over this sort of fielded text, and we should take advantage of it).
  2. A lot of searches are for titles or authors and we should take advantage of the database where possible to do this more accurately. This is an example of the kind of parametric search project which we discussed last quarter.
  3. The project should have a web search interface. This is probably most naturally done using JSP/Tomcat (talk to Teg about this!). You should aim for an interface which is natural, and easily supports the kinds of questions that people would commonly want to ask.

Resources

Most of the discussion of user interfaces and parametric search in cs276A is relevant.


Back to the CS276B homepage
Last modified: Wed Jan 8 11:00:59 PST 2003