BioXtract: Learning Biomedical Knowledge From General and Random Data

The privacy of medical documents and protected healthcare information can oftentimes limit the accessibility of accurate biomedical natural language processing models. Distillation can be used to transfer knowledge from these models, but it typically relies on having related data to distill on. In this work, we investigate the distillation of BERT-based biomedical models using transfer datasets from varying domains, including general data, randomized general data, and biomedical data. We find that general data can be used to learn task-specific biomedical knowledge, especially when we can initialize student models with similar weights to the teacher. We observe that randomized general data can also be used to transfer knowledge, but it is not as effective as general data. We hope that our findings bring attention to both the benefits and potential dangers of the widespread use of mixed-domain pretraining in NLP, particularly relating to models that continue their pretraining process on private data.