Learning Links Between Low Level and Preferred Medical Terminology with GNNs

Relational databases like MedDRA can be valuable tools for learning technical medical language. In my project, I was interested in leveraging the tree structure of the MedDRA ontology to learn node embeddings for the natural language phrases contained at each level of the ontology. I begin by inferring a graph structure from the hierarchical relations defined between phrases in MedDRA. I then use a Graph Neural Network (GNN) to learn node embeddings which can be used to predict relationships between "low level terminology" - i.e. the kind of phrases used when doctors are talking to patients - and "preferred terminology" - i.e. standardized technical medical jargon. In this project I explored the effects of modeling the ontology using different graph structures (both homogeneous and heterogeneous). I further explore the effectiveness of pairing the GNN with various Bert models. My clearest finding is that the use of a heterogeneous GNN significantly outperforms a standard GNN in all experimental settings. I found that the heterogeneous GNNs are, on average, able to achieve approximately 70% accuracy in predicting the links between low level terminology and the appropriate preferred terminology. Additionally, and somewhat surprisingly, I found that using pretrained BERT models - either specialized medical BERT models like BlueBERT, or the standard base BERT model - to initialize note embeddings did not noticeably outperform the case of random node feature initialization.