Faster Attention for Question Answering

img
In this project (a default final project on the IID track), I built a question-answering system for SQuAD 2.0 by exploring both the BiDAF model through modifications of the default baseline as well as a from scratch implementation of QANet, a self-attention-based question-answering architecture. The BiDAF modifications which added character embeddings achieved a small, but significant improvement over the baseline model on the test set. However, the QANet models only nearly matched the baseline BiDAF scoring with character embeddings. Curiously, not only did my QANet under-perform the baseline in model performance, it also turned out to be significantly slower to train and at inference time on GPUs. Though profiling, I found that the QANet model is indeed faster on CPUs, however significantly under-performs the baseline BiDAF model on GPUs because the BiDAF model's slowest component, the RNN, is implemented as a highly optimized CuDNN routine on GPUs that the custom QANet encoder block did not benefit from. Finally, this profiling also shows that faster attention mechanisms, as explored in the literature, are unlikely to improve performance on this particular SQuAD 2.0 workload as additional instruction overhead would likely wash out any performance gains absent better operation compilation for GPUs or a custom GPU kernel.