Transformers for Textual Reasoning and Question Answering

Transformers are the predominant architecture of choice for current neural reasoning tasks. This is because of their great performance, but also because of their great convenience with the advent of pretraining large models such as BERT and GPT and simply fine tuning them on different tasks. However, the canonical transformer models have been found to learn and rely on heuristics for evaluation when trained on simple training sets such as SQuAD or RuleTakers which simply require local phrase matching or shallow textual reasoning. In particular, the high performance transformers achieved on these tasks cannot demonstrate their ability to learn long-range relations and a holistic understanding of the text. We propose methods of reducing the attention mechanism of the transformer to be sparser so as to encourage generalized learning as well as using a semi-synthetic dataset to generate training and testing examples to encourage robustness and inference properties. Our results demonstrate that these changes yield improvements in performance for difficult reasoning tasks, generalizability, and learning efficiency.