Data Augmentation: Can BERT Do The Work For You?

img
Data augmentation has been proved effective in analyzing a neural model's robustness and improving it by re-training with augmented data. Because text data's discrete feature space, most data augmentation techniques require querying multiple systems for language knowledge and meticulous augmentation rule design by researchers. This paper aims to explore the effectiveness of an automatic, black-box data augmentation method using language models, bert context rewriter, and to compare it with another augmentation algorithm, token reorderer, which uses Universal Sentence Encoder's semantic knowledge. Given a baseline question answering model, we employ DistilBERT masked language model (mlm) to rewrite masked context data and evaluate whether re-training with the augmented data can improve the robustness of the baseline model. This augmentation relies on the existing language knowledge learnt by DistilBERT mlm and does not use additional hand-crafted rules. We also explore how different configurations, including masked token percentage and additional mlm fine-tuning, affect our method's effectiveness. Preliminary experiments show that both our methods obtain improved performance on out-of-domain dev set over the baseline and reduce the performance gaps between in-domain and out-of-domain datasets. However, token reorderer's performance is consistently better than bert context rewriter's in both out-of-domain evaluation (+2.9 F1/+2.9 EM versus +1.9 F1/+1.6 EM) and reducing in-domain out-of-domain gaps (-5.3 F1/-4.8 EM versus -1.7 F1/-2.5 EM) and therefore is more effective in improving the baseline model's robustness.