Ontologies for data integration:

Ontology-driven indexing of public datasets

The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields that are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. In this project we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality is to enable users to search for biomedical data resources related to particular ontology concepts. The prototype system can be accessed at http://bioportal.bioontology.org/all_resources

The NCBO Annotator

Building on the methods developed during the annotation of of tissue microarray data and the largest dictionary of biomedical terms (compiled by the National Center for Biomedical Ontology), we developed a web service that allows users to annotate any text using terms from any ontology from the Open Biomedical Ontologies set or the UMLS. The Annotator can be accessed at http://bioportal.bioontology.org/annotate

Annotation of Tissue Microarrays using NCI-Thesaurus

The Stanford Tissue Microarray Database (TMAD) is a repository of data amassed by a consortium of pathologists and biomedical researchers. The TMAD data are annotated with text fields, specifying the pathological diagnoses for each tissue sample. We developed methods to map these annotations to the NCI thesaurus and the SNOMED-CT ontologies. Using these two ontologies we can effectively represent about 80% of the annotations in a structured manner. This mapping offers the ability to perform ontology driven querying of the TMAD data.

Pathway Knowledge Base

Pathway Knowledge Base, is a proof of concept resource demonstrating that it is possible to integrate multiple pathway data sources using BioPAX.

UMLS Query

The Metathesaurus from the Unified Medical Language System (UMLS) is a widely used ontology resource, which is mostly used in a relational database form for terminology research, mapping and information indexing. UMLS-Query, is a Perl module that provides functions for retrieving concept identifiers, mapping text-phrases to Metathesaurus concepts and graph traversal in the Metathesaurus stored in a MySQL database. UMLS-Query can be used to build applications for semi-automated sample annotation, terminology based browsers for tissue sample databases and for terminology research.

Mapping the OBO format to OWL

Many possible formats have been defined for ontologies, and two of the most significant in the biomedical domain are the Open Biomedical Ontologies Format (OBOF) and Web Ontology Language (OWL). The goal of this project is to develop a mapping between the OBOF and OWL formats as well as inter-conversion software.

Knowledge representation and Inference:

The rapidly increasing volume and diversity of biological data poses a challenge for integrating all the information in a coherent manner to assist the experimentalist in further experiment design. This requires a common language (and a knowledge model) for representing biological objects and processes, as well as methods for expressing alternative hypotheses (or models) and 'biological inference rules' that will evaluate these hypotheses against what is already known.


HyQue (for Hypothesis-based Querying of pathway models), will take as input working hypotheses about pathway models expressed in a knowledge-based formalism, evaluate their consistency using existing data in a knowledgebase, and provide as output contradictory evidence and suggestions for improving hypotheses. HyQue will incorporate formal knowledge representations based upon Semantic Web standards and an ontology to represent biological objects and relationships. The heart of this project is the development and prototyping of a new paradigm for the query and integration of diverse biological data.

HyQue is based on the prior work on HyBrow.


HyBrow (Hypothesis Browser) is a tool for the representation, manipulation and integration of diverse biological data - such as gene expression, protein interactions & annotations - with prior biological knowledge for the purpose of evaluating alternative hypotheses. The prototype system is available at hybrow.org. Hybrow's purpose is to evaluate and rank hypotheses based on user-defined 'rules', and consistency with all information available to it.

Graduate projects

Molecular Profiling of cancer

With the advent of genome level expression profiling, cancer staging and prognosis have undergone a revolution. Traditional staging and predictive factors rely on a few specific cell surface, histological, or gross pathologic features. Gene expression profiles have the potential to supplement these with many thousands of features increasing the accuracy of staging and prognosis. It is conceivable to think of an 'expression atlas' similar to a histological atlas for diagnosis and accurate classification of tumors.

I have worked on a project to identify characteristic 'signatures' or patterns of gene expression for well known cancer related signaling pathways (e.g. p53 network) for 12 tumor types. This work was with Functional Genomics & Systems biology group at IBM T. J. Watson research center in the summer of 2002.

Microarray data analysis and Promoter sequence analysis

Filtering and normalization of microarray data (especially for a large number of files) is a tedious task, I wrote scripts to facilitate this process and make it less painful. I also developed scripts that would allow further processing (centering of data, Clustering based on some predefined profiles etc). Searching for binding sites for known transcription factors in the promoters of a set of genes of interest is a very common analysis after microarrays. I wrote some programs to compile relevant promoter datasets and then perform a search for known motifs in promoters of Arabidopsis Thaliana genes and finally draw an image for the occurance of various motifs in the promoters.

Gene Ontology based analysis of Microarray data

After microarray data analysis we usually end up with a list of genes deemed 'significantly changed'. In order to attach meaning to such lists, a common analysis is to identify what functional categories are enriched in the list of significant genes (or in a cluster of interest). I have developed a program (called CLENCH, from Cluster Enrichment) that allows such analysis for Arabidopsis thaliana.

Recent changes RSS feed Creative Commons License Donate Minima Template by Wikidesign Driven by DokuWiki