Gaining More from Less Data in out-of-domain Question Answering Models

We propose text augmentation techniques for Question Answering task in NLP that involves using synonyms with stochasticity on out-of-domain datasets (DuoRC and RACE and RelationExtraction) that are set to be 400 times smaller than the in-domain datasets (SQuAD, NewsQA, NaturalQuestions). We illustrate QSR, SIBA, SIAA, CCS and CD augmentation strategies above, that help improve extraction of generalized information from out-of-domain or less available datasets from large pre-trained models BERT variant DistilBERT being able to benefit from producing QA applications across domains. It is found that augmenting less available QA datasets in a way described, indicate improvement in generalization, but not all augmentations strategies are equally good, a combination of 3x QSR, 3x SIBA, 3x SIAA, 3x CCS performed the best (as illustrated above) with exclusion of CD (this negatively impacted scores). We also define a metric EM+ (exact match plus) that is a binary measure if prediction is a superset of the answer, EM+ = 1, else 0; provides a less overfit-perspective as a performance metric than EM. We conjecture from analysis done in the paper that increasing unique words in OOD that aren't present in ID, help improve with performance.