MedDRA2Vec: Training Medical Graph Embeddings for Clinical NLP

Creating word embeddings for biomedical terminology is an up-and-coming field. While vast amounts of data are available--encoded in texts, medical codes, ontologies, and patient data--researchers have struggled to create medical term embeddings that are widely used, reproduced, and studied, since many primary data sources of clinical notes and patient data have restricted access. In response to these challenges, our research derives embeddings from an open-access resource, namely MedDRA, the Medical Dictionary for Regulatory Activities. We use two different embedding methods, Poincare and Node2Vec, to derive embeddings and evaluate these embeddings using cosine concept similarity and the prediction of diagnoses from a patient visit given the diagnoses of a previous visit. Our research shows that on concept similarity and patient diagnosis tasks, MedDRA embeddings are comparable, or even better in some cases, to BioBert embeddings. The latter embeddings are derived from a BERT model that is additionally trained on large biomedical corpora, namely PubMed articles. The MedDRA embeddings also performed close to the accuracy of Snomed2Vec embeddings on the patient diagnosis task, although slightly worse. Snomed2Vec embeddings are trained using the SNOMED-CT ontology, which is one of the largest open-access biomedical ontologies. Our research shows that ontologies other than SNOMED-CT can be used to derive competitive medical term embeddings. We also provide our full code base and resulting embeddings for recreation and further research within the biomedical NLP field.