Improved Robustness in Question-Answering via Multiple Techniques

Question-answering models are one of the most promising research areas in NLP. There has already been much study and development on how to accurately search for the correct answer of unanswered questions when the question is in the training domain. The usage of pretrained language models, such as ELMo, GPT, BERT, etc., enables knowledge gained from pretraining to be transferred to the new model. Although some models might outperform human performance for in-domain data, when it comes to out-of-domain data, they have poor performances. For the real world applications, we want a QA model to be able to both cover various domains and generalizes well on the out-of-domain data, hence the idea of domain generalization is proposed. For most of the current QA models, additional data is required to learn the new domains, and models tend to overfit on specific domains. Since it's impossible for a QA model to train on all domains, it's crucial to apply different techniques to build domain-agnostic QA models that can learn domain-invariant features instead of focusing on the specific features when there's limited training data. In this project, we are given three types of in-domain datasets: SQuAD, Natural Questions, NewsQA, and three types of out-of-domain datasets : DuoRC, RACE, RelationExtraction. By reading through papers, we've learned different techniques, such as adversarial training, data augmentation, task-adaptive pretraining, etc., that might help with domain generalization. We first apply them individually on the given baseline model, DistilBERT, compare their F1 and EM scores, and analyze their performances. Then, we apply several combinations of the techniques, and further explore the performances of their combinations. Eventually, the task-adaptive pretraining model gives us the best result, an increase of 2.46 in F1 score, and an increase of 3.92 in EM score compared to the baseline.