E[score|-aware Question Answering

Use of F1 score for evaluating question answering systems place a higher penalty on bad answer prediction than on bad `no answer' prediction. This is because a predicted answer for an answerable question may not match the reference answer, and can often result in an F1 score below 100; on the other hand, correctly predicted `no answer' always results in an F1 score of 100. Exploiting this imbalance by biasing a model to prefer `no answer' can improve model performance. The biasing can be done on the output of the softmax, as shown in the image, or by augmenting training sets with over-representation of unanswerable examples. While these methods change the trained model, another method for biasing is to modify the discretization of probability distributions. All three forms of biasing successfully achieve the goal of exploiting imbalance in expected score. Combined, they improve the F1 score by 4.13 compared to an unbiased model; the biased model obtains F1 and EM scores of 68.28 and 65.47 on the development set and 65.25 and 62.42 on the test set.