Stanford Research Communication Program
  Home   Researchers Professionals  About
Archive by Major Area


Social Science

Natural Science

Archive by Year

Fall 1999 - Spring 2000

Fall 2000 - Summer 2001

Fall 2001 - Spring 2002

Fall 2002 - Summer 2003




Discovering Patterns in Relational Data

Lise Getoor
Computer Science
Stanford University
June 2001

The goal of my research is to discover statistical patterns in relational data. A relational domain consists of objects and the relationships, or links, between them. Examples of such domains include the World Wide Web (Web pages and the links between pages), epidemiological studies (patients with disease strains, contacts between patients, and potential disease transmissions), and business data (companies, company officers, and event information such as mergers and acquisitions). The discovery of patterns in these types of domains has many obvious benefits, from simply gaining valuable new insights into the domain, to allowing us to make more accurate future predictions.

Traditional machine learning and statistical methods can discover certain types of structures, including correlations and independence among domain attributes, but they are limited to working with flat files, e.g. data consisting of fixed-length vectors of attribute values. In order to apply existing approaches to relational domains, the data must first be flattened. While there are a number of problems associated with doing this, a key issue is the loss of the relational structure. This structure itself may be very useful for prediction and should not be disregarded. For example, a Web page that has many pointers to it is likely to be more useful than a Web page that is not referenced; and a patient who has infected many of his/her contacts is more likely to be infected with a highly contagious disease strain than with one that is less contagious.

In our research, we are addressing this problem by developing methods for learning statistical models directly from relational data. The algorithms build on the work in learning Bayesian networks, and provide powerful and flexible learning algorithms. A learned Probabilistic Relational Model (PRM) provides a statistical model that can uncover many interesting probabilistic dependencies held in a domain. Unlike a set of (probabilistic) rules for classification, PRMs specify a joint distribution over a relational domain and thus can be used for answering queries about any aspect of the domain, given any set of observations. Furthermore, rather than trying to predict one particular attribute, the PRM learning algorithm attempts to weed out the most significant direct dependencies in the data. The resulting model provides a high-level, qualitative picture of the structure of the domain, in addition to the quantitative information provided by the probability distribution. PRMs are ideally suited to exploratory analysis of a domain and relational data mining.

We have applied these methods in a number of domains. One for which the results have been particularly exciting and rewarding has been in the study of tuberculosis epidemiology. In collaboration with Peter Small, M.D. and his research group at the Stanford Medical Center, we analyzed clinical and epidemiological data from the San Francisco tuberculosis clinic for over 2,000 tuberculosis patients and their contacts. The database contains patient demographic attributes as well as clinical attributes such as age at diagnosis, ethnicity, homelessness, HIV status, x-ray results, disease site (e.g. pulmonary or extra-pulmonary) and whether the patient was treated privately or at the TB clinic. Additionally, sputum samples were obtained for each patient and analyzed to determine the strain of tuberculosis causing the disease in the patient. A contact investigation was performed for each patient to identify persons with whom the patient was in contact during his/her infectious period. Data for each contact include the relationship of the contact to the case (e.g., family member, co-worker, etc.), the contact's age, and whether the contact is a member of the patient's household. Tuberculin skin test results and any contact disease treatment were also available.

The structure of the PRM we learn in this domain has a rich dependency structure both within the patient, contact and strain classes and between attributes in different classes. We showed this model to our collaborators and they found the model quite interesting. They thought many of the dependencies to be quite reasonable, such as the dependence of age at diagnosis on HIV status (typically, HIV-positive patients are younger and are infected with TB as a result of AIDS), and the dependence of the contact's age on the type of contact (contacts who are co-workers are likely to be younger than contacts who are parents and older than those who are school friends).

There are also some correlations that are clearly relational and which would have been difficult to detect using a non-relational learning algorithm. For example, there is a dependency between the patient's HIV result and whether he/she transmits the disease to a contact HIV-positive patients are much more likely to transmit the disease. Another interesting relational dependency is the correlation between the ethnicity of the patient and the number of patients infected by the strain patients who are Asian are more likely to be infected with a strain which is unique in the population, whereas other ethnicities are more likely to have strains that recur in several patients. The reason being that Asian patients are more often immigrants who come to the U.S. with a new strain of TB, whereas other ethnicities are often infected locally. There are also dependencies that seem to indicate a bias in the contact investigation procedure or in the treatment of TB. For example, contacts who were screened at the TB clinic were much more likely to be diagnosed with TB and receive treatment than contacts who were screened by their private medical doctor. Our domain experts were quite interested in identifying these patterns and using them to help develop better investigation guidelines.

This is just one of the domains that we have explored; others include gene expression data, U.S. Census data, and Security and Exchange Commission data on company mergers and acquisitions. There are a number of other potentially interesting domains to explore, such as market segmentation and e-commerce transactions. In conclusion, our learning algorithms for PRMs provide a powerful technique for discovering the statistical structure in relational domains. These methods are applicable in a wide variety of settings, and unlike traditional methods, do not require the data to be reformatted in a manner that destroys the information contained in the relational structure among objects in the domain.