Robust Question Answering with Task Adaptive Pretraining and Data Augmentation

img
Existing research suggests that task adaptive pretraining (TAPT) with data augmentation can enhance classification accuracy on a wide array of natural language processing (NLP) tasks. This project aims to evaluate whether TAPT improves performance on a robust question answering (QA) system. The baseline model, which finetunes DistilBERT on SQuAD, NewsQA, and Natural Questions datasets, achieves an EM score of 33.25 and F1 score of 48.43 when validated on the out-of-sample DuoRC, RACE, and RelationExtraction datasets. Applying TAPT to the out-of-domain unlabeled training datasets using masked language modeling (MLM) without data augmentation, we do not observe an increase in either metric of performance. However, not using TAPT, our model performance is enhanced when we use backtranslations to augment only a small portion of the training data for finetuning, achieving an EM of 36.91 and F1 score of 50.16 on the out of domain validation set. This model also achieves an EM of 41.628 and F1 of 58.91 on the out of domain test set. These results thus suggest that data augmentation alone, even to a highly limited extent, may account for the improvements in model performance.