An Unsupervised Pretraining Task for the BiDAF Model

Over the past few years and particularly since "Attention is All You Need" was published, the NLP community has moved away from LSTM-based architectures because of the benefits seen by attention-only networks with extensive unsupervised pre-training. This project demonstrated that EM, F1 and AvNA scores can be improved on a BiDAF model simply by pretraining on a similar task to that used in the original BERT paper. While the BERT paper used a Masked Language Model (MLM) and Next Sentence Predictions (NSP), this paper utilizes a novel variant of MLM, termed Obscured Replacement Language Model (ORLM), to enable the strict input-output mappings of a BiDAF model to learn from an unsupervised task. Specifically, this paper shows that performance gains over the baseline BiDAF model can be achieved using ORLM, as judged by the EM and F1 scores. Furthermore, pretraining the BiDAF model with this method decreases the amount of training required on the SQuAD 2.0 training dataset to achieve similar performances, while boosting task-specific metrics such as the AvNA score. As the community concretely moves away from LSTM-based architectures, there is room to ask whether the true top-end performance of those architectures was explored, even if they continue to fall short of state-of-the-art.