Exploration of Attention and Transformers for Question and Answering

The project was intended to be an exploration of convBERT model without pretraining, but after training a base BERT model (encoder only Transformer) and achieving very low performance, the objective shifted towards trying to understanding transformers and attention for Question Answering. Experiments on both hyperparameters and network architecture was done on the BERT model, with conclusion that this model will either overfit, or not converge. A hypothesis is suggested that without large corpus pretraining, simple self attention on a concatenated context and question has big difficiencies vs explicit cross attention to learn SQuAD. QAnet model was also trained for purposes of comparisons and analysis.