Effect of Character- and Subword-Embeddings on BiDAF Performance

Systems trained end-to-end have achieved promising results in question answering the past couple of years. Many of the deep-learning based question answering systems are trained and evaluated on the Stanford Question Answering Dataset (SQuAD), where the answer to every question is either unanswerable or a segment of text from the corresponding reading passage [4]. In this work, we investigate the effectiveness of different embeddings in improving the performance of the baseline Bi-Directional Attention Flow model on solving SQuAD 2.0. The first model improves upon the baseline with character-level embeddings; the second model improves with subword-level embeddings; the third improves with both character-level and subword-level embeddings. Our best model, which incorporates word-level and subword-level embeddings, achieves an EM score of 57.70 and F1 score of 61.26 on the test set.

Building a QA system (IID SQuAD track)

Question Answering is a interesting machine learning task which shows how machine can understand the relationship and the meaning of the words. There are lots of existing models built to solve this task. This paper draws inspiration from the paper Bidirectional Attention Flow for Machine Comprehension and dive deeper into the effect of character level embedding on the performance of the model. Through experimenting on different CNN model for character level embedding, we have concluded that a more complex CNN model does not result in a better performance metrics. However, through manually evaluate the model's prediction, we have found that a more complex model does perform better in certain cases.

Improving QA System Out of Domain Performance Using Data Augmentation

In recent years question and answering (QA) systems have become widely used in many modern technology applications, such as search engine querying and virtual assistants. However, despite recent advances in QA modeling, these systems still struggle to generalize to a specific domain without specialized training data and information about that domain's distribution. For this reason, we investigated the effectiveness of different data augmentation and sampling techniques to improve the robustness of the pre-trained DistilBERT QA system on out of domain data. We trained the DistilBERT model on the in domain data and then experimented with fine-tuning using augmented versions of the out of domain data. To generate the additional data-points we performed random word deletion, synonym replacement, and random swapping. We found that all the fine-tuned models performed better than the baseline model. Additionally, we found that our optimal synonym replacement model performed the best on the out of domain test set, and that the combination model of synonym replacement and deletion also led to increased performance over the baseline. Overall, we conclude that data augmentation does increase the ability of our question answering system to generalize to out of domain data and suggest that future work could look further into applying combinations of these data augmentation techniques.

RobustQA

Project summaries unavailable

Building a QA system (IID SQuAD track)

In this project, we are dealing with building a Question Answering System that is expected to perform well on SQuAD. Our approaches to this task include the retraining of baseline model, improvement on embedding (BiDAF), modification of attention (Dynamic Coattention Model), replacement of LSTM with GRU and application of transformer (QANet). After experiments with different models and modifications, both BiDAF and QANet outperform the baseline model, with QANet being our best model. It takes some advantages of various features in other modifications mentioned before, and it consists of four layers: (1) Embedding layer where the combination of character-level and word-level embedding uses the pre-trained word embedding model to map the input into vector space. (2) Contextual embedding layer where the encoder block utilized contextual cues from surrounding words to refine the embedding of the words. (3) Attention flow layer where the coattention-like implementation produces a set of query-aware feature vectors for each word in the context. (4) Modeling and output layer where a stack of encoder blocks with fully-connected layers are sued to scan the context and provide an answer to the query. By submitting our best model to the test leaderboard, we have obtained satisfying results with F1 of 66.43 and EM of 62.45.

Fine Grained Gating on SQUAD

The purpose of this project is to implement an embedding mechanism on top of the BiDaf model that serves as a compromise between word-level embeddings and character-based embeddings that can compete with a simple concatenation of word and character level embeddings. In particular, the mechanism is what is called a fine-grained gating method, in which, given a character level embedding $c$ and a word-level embedding $w$, a parameter $g$ is learned such that final embedding of a given word is $g \odot c + (1-g) \odot w$, where $\odot$ represents termwise multiplication. After various experiments varying the methods by which the parameter $g$ is learned, results ultimately show that none of the fine-tuned gating methods perform better than mere concatenation of the word and character embeddings.

Domain Adversarial Training for QA Systems

In our CS224N project, we examine a QA model trained on SQuAD, NewsQA, and Natural Questions and augment it to improve its ability to generalize to data from other domains. We apply a method known as domain adversarial training (as seen in a research paper we reviewed by Seanie Lee and associates) which involves an adversarial neural network attempting to detect domain-specific model behavior and discouraging this to produce a more general model. We explore the efficacy of this technique as well as the scope of what can be considered a "domain" and how the choice of domains affects the performance of the trained model. We find that, in our setting, using a clustering algorithm to sort training data into categories yields a performance benefit for out-of-domain data. We compare the partitioning method used by Lee et al. and our own unsupervised clustering method of partitioning and demonstrate a substantial improvement.

Robust Question Answering with Task Adaptive Pretraining and Data Augmentation

Existing research suggests that task adaptive pretraining (TAPT) with data augmentation can enhance classification accuracy on a wide array of natural language processing (NLP) tasks. This project aims to evaluate whether TAPT improves performance on a robust question answering (QA) system. The baseline model, which finetunes DistilBERT on SQuAD, NewsQA, and Natural Questions datasets, achieves an EM score of 33.25 and F1 score of 48.43 when validated on the out-of-sample DuoRC, RACE, and RelationExtraction datasets. Applying TAPT to the out-of-domain unlabeled training datasets using masked language modeling (MLM) without data augmentation, we do not observe an increase in either metric of performance. However, not using TAPT, our model performance is enhanced when we use backtranslations to augment only a small portion of the training data for finetuning, achieving an EM of 36.91 and F1 score of 50.16 on the out of domain validation set. This model also achieves an EM of 41.628 and F1 of 58.91 on the out of domain test set. These results thus suggest that data augmentation alone, even to a highly limited extent, may account for the improvements in model performance.

Task-Adaptive Pretraining, Domain Sampling, and Data Augmentation Improve Generalized Question Answering

To create a deep-learning question answering (QA) system that generalizes to unseen domains, we investigate the use of three techniques: task-adaptive pretraining (TAPT), domain sampling, and data augmentation. We train a single DistilBERT model in three phases (shown in the flowchart). First, during TAPT, we pretrain with masked-language modeling (MLM) on our QA datasets. Second, we fine-tune on our QA data. We employ domain sampling during both pretraining and fine-tuning, which preferably samples data that lead to better downstream performance. Finally, for our data augmentations, we use synonym replacement and random deletion to increase the size and variety of our out-domain data, before additionally fine-tuning on these augmented data. During evaluation, we found significant EM/F1 performance improvements by fine-tuning on augmented out-domain data. We found modest, yet non-trivial, performance improvements with TAPT and domain sampling. Using these three techniques, our model achieved EM/F1 scores of 37.44/51.37 on the development set and 40.12/58.05 on the test set.

Robust Question Answering using Domain Adversarial Training

While recent developments in deep learning and natural language understanding have produced models that perform very well on question answering tasks, they often learn superficial correlations specific to their training data and fail to generalize to unseen domains. We aim to create a more robust, generalized model by forcing it to create domain-invariant representations of the input using an adversarial discriminator system that attempts to classify the outputs of the QA model by domain. Our results show improvements over the baseline on average, although the model exhibited worse performance on certain datasets. We hypothesize that this is caused by differences in the kind of reasoning required for those datasets, differences which actually end up being erased by the discriminator.

Question Answering with Self-Attention

Question Answering (QA) is an increasingly important topic in NLP with the proliferation of chatbots and virtual assistants. In this project a QA system is built by exploring two end-to-end models: Firstly, the baseline BiDAF model was improved by adding a character embedding layer with multiple convolutional layers, an extra embeddings attention layer which captures the "summary" of the embedding vectors, a context-to-context self-attention layer, gated recurrent units (GRU) and Swish activation. Secondly, the QANet model was re-implemented from scratch and successfully explored some hyperparameter finetunings to improve performance. The improved BiDAF model (SA-BiDAF++) incorporating self-attention, achieved 65.3 EM / 68.8 F1 scores on the test set of the SQuAD 2.0. That is a clear indication that architecture fine-tunings and optimizations can improve significantly the performance of non-PCE models.

An Unsupervised Pretraining Task for the BiDAF Model

Over the past few years and particularly since "Attention is All You Need" was published, the NLP community has moved away from LSTM-based architectures because of the benefits seen by attention-only networks with extensive unsupervised pre-training. This project demonstrated that EM, F1 and AvNA scores can be improved on a BiDAF model simply by pretraining on a similar task to that used in the original BERT paper. While the BERT paper used a Masked Language Model (MLM) and Next Sentence Predictions (NSP), this paper utilizes a novel variant of MLM, termed Obscured Replacement Language Model (ORLM), to enable the strict input-output mappings of a BiDAF model to learn from an unsupervised task. Specifically, this paper shows that performance gains over the baseline BiDAF model can be achieved using ORLM, as judged by the EM and F1 scores. Furthermore, pretraining the BiDAF model with this method decreases the amount of training required on the SQuAD 2.0 training dataset to achieve similar performances, while boosting task-specific metrics such as the AvNA score. As the community concretely moves away from LSTM-based architectures, there is room to ask whether the true top-end performance of those architectures was explored, even if they continue to fall short of state-of-the-art.

Building a Robust QA system using an Adversarially Trained Ensemble

Despite monumental progress in natural language understanding, QA systems trained on giant datasets are still vulnerable to domain transfer. Evidence shows that language models pick up on domain-specific features which hinders it from generalizing to other domains. In this project, we implore the use of adversarial networks to regularize the fine-tuning process which encourages the generator model to learn more meaningful representations of context and questions. We then construct an ensemble of these models based on each model's performance on specific subgroups of questions.

QANet for SQuAD 2.0

QANet model was one of the state-of-the-art models for SQuAD 1.1. Does its top-notch performance transfer to the more challenging SQuAD 2.0 dataset containing unanswerable questions? How does the model size affect performance? Is the bi-directional attention layer really necessary in a transformer-style architecture? These are the questions, I tried to answer in this project. Compared to the three baselines derived from the BiDAF model, QANet achieved substantially higher F1 and EM scores of 67.54 and 63.99 respectively. However, these scores are significantly lower than those of the current state-of-the-art models, mainly because the model couldn't correctly handle unanswerable questions. Next, experiments with model size showed no performance degradation with smaller-sized QANet variants. In fact, these variants slightly outperformed the base QANet. Lastly, a new model built entirely using QANet's building blocks (without an explicit bi-directional attention layer) outperformed all of the baseline models even without finetuning. Its performance is still below the base QANet model most likely because the model started overfitting roughly midway through training. I believe adding more regularization and further finetuning would bring its performance close to that of the base QANet model.

Extending QANet with Transformer-XL

This project tackles the machine reading comprehension (RC) problemon the SQuAD 2.0 dataset. It involves inputting a context paragraph and aquestion into a model and outputting the span of the answer from the contextparagraph. This project aims to extend the QANet, so that it can effectivelyperform RC on SQuAD 2.0. The segment-level recurrence with state reuse fromTransformer-XL is integrated into QANet to improve its ability of tacklinglong context paragraph (referred to as QANet-XL). In addition, character embeddings and a fusion layer after context-query attention are used to extend BiDAF. Experiments show that QANet-XL underperforms the vanillaQANet and outperforms the extended BiDAF. The segment-level recurrence mech-anism from Transformer-XL is proven not a proper improvement for QANet on theSQuAD 2.0 dataset, since segmenting context paragraph is somewhat harmful. For the dev set, The extended BiDAF achieved EM/F1 = 62.16/65.98, the vanilla QANet achieved EM/F1=66.81/70.38, and the QANet-XL achieved EM/F1 = 63.12/66.67. A majority voting ensemble model based on previous mentioned models achieved EM/F1=66.85/69.97 on the test set.

An Analysis on the Effect of Domain Representations in Question Answering Models

Studies of robust reading comprehension models have included both learning domain specific representations and domain invariant representations. This project analyzes the effectiveness of each of these approaches using Mixture-of-Experts (MoE) and adversarial models. In the domain specific approach, MoE's form a single expert model for each input domain (Guo et al., 2018, Takahashi et al., 2019). In contrast, domain invariant models learn a generalized hidden representation that cannot distinguish the domain of the input (Ma et al., 2019, Lee et al., 2019). Additionally, models are assessed to determine their level of understanding of natural language against learning simple linguistic bias heuristics.

Improving QA Robustness through Modified Adversarial Training

We improved the domain generalizability of a DistilBert Question Answering (QA) model by implementing adversarial training. By putting a conventional QA model in competition with a discriminator, we were able to generate domain invariant features that improved the QA model's robustness. We augmented this strategy by retraining our model on all of our available datasets to gain the best performance. Our model performed better than the baseline with unseen out of domain datasets.

Reformed QANet - Optimizing the Spatial Complexity of QANet

The feed-forward QANet architecture replaced the bidirectional LSTMs of traditional question and answering models by using encoder components with convolution + self-attention to increase the speed of the model without sacrificing accuracy. We achieved scores of 64.5 EM/67.9 F1 on the dev set and 61.64 EM/65.30 F1 on the test set. While the parallel nature of QANet's CNN architecture allows for a significant speed boost, it means that minimizing GPU memory usage is crucial to attain these benefits. In this report we perform an exhaustive study investigating changes to spatial complexity, speed, and performance on the QANet architecture by replacing components in the encoder block with memory-efficient alternatives such as LSH Self Attention, reversible residual networks, and reformer blocks. The image above depicts the QANet encoder block where the self-attention and feed-forward layer are replaced with a reformer, a stack of reversible LSH Self Attention and feed-forward layers. We found that implementing LSH attention successfully decreased memory usage on long sequences while maintaining reasonable performance. While the other modifications did not quite maintain the original QANet model's EM and F1 scores, they significantly decreased GPU memory usage. Additionally, we used data augmentation to enrich training data through back translation and found slight improvements on our larger model.

DistiIBERT Augmented with Mixture of Local and Global Experts

Few-shot systems are valuable because they enable precise predictions using small amounts of expensive training data, making them particularly cost-efficient. In this paper, we explore a technique to improve the few-shot question answering capabilities of a pre-trained language model. We adjust a pre-trained DistilBERT model such that it leverages datasets with large amounts of training data to achieve higher question-answering performance on datasets with very small amounts of available training data using a novel inner- and outer-layer Mixture of Experts (MoE) approach. Practically, we first connect pre-trained DistilBERT models and an MoE layer in sequence (inner-layer) and train them on all high-availability data and on a single dataset with low data availability. Then we use several of these DistilBERT-MoE models in parallel to predict observations from multiple datasets with low data availability (outer-layer). We find that the noise reduction achieved by training designated DistilBERT-MoE models for different datasets with low data availability yields greater prediction benefits than the (possibly) increased transfer learning effects achieved by training a single DistilBERT-MoE model on all high- and low-availability datasets together. Both our inner-outer-MoE method and a single DistilBERT-MoE model outperform the baseline provided by a finetuned DistilBERT model, suggesting that the mixture of experts approach is a fruitful venue to enabling robust predictions in contexts with few training examples.

Domain-agnostic DistiIBERT for robust QA

In this project, we worked on improving the robustness of DistilBERT to out-of-distribution data in a question answering task by employing multi-phase continued pre-training and data augmentation. The in-domain datasets included SQuAD, NewsQA, and Natural Questions, while the out-of-domain datasets included DuoRC, RACE, and RelationExtraction. For multi-phase pre-training, we first analyzed the domain similarity between the in-domain and out-of-domain datasets and found NewsQA to be the most similar dataset to the downstream task of question answering based on examples from DuoRC, RACE, and RelationExtraction datasets. We then first trained the model on in-domain datasets and called it the second-phase continued pre-training. After using NewsQA for third-phase continued pre-training, we used data augmented with synonym and antonym replacement to perform the fourth-phase pre-training. The best model achieved performance, as evaluated by EM/F1 score, of 35.60/51.23 on validation datasets and 40.39/59.42 on test datasets in comparison to the baseline of 29.06/46.14 on validation datasets.

DAM-Net: Robust QA System with Data Augmentation and Multitask Learning

If the machine can comprehend a passage and answer questions based on the context, how to upgrade a QA system to generalize to unseen domains outside the training data? In this project, we propose DAM-Net, a robust QA model that can achieve strong performance even on test examples drawn beyond their training distributions. Specifically, we perform data augmentation on our training data, expand training with the auxiliary task (i.e. fill-in-the-blank), and utilize multi-domain training with additional fine-tuning. DAM-Net has shown strong performance on the robust QA benchmark and sometimes it even outperforms humans in terms of the comprehensiveness and accuracy of the answers!

Building a QA system (IID SQuAD track)

The goal of the project is to build a question answering system that works well on SQUAD dataset. The system should be able to read a paragraph and answer a question correctly related to the paragraph. This is an interesting task because it measures how well the system can interpret text. Reading Comprehension is an important field and being able to develop systems that can interpret text at human level will be able to lead us to the next revolution in Artificial Intelligence. The input to the system is a paragraph and a question related to the paragraph and the output from the system is the answer to the question based on the text in the paragraph. We have developed a system implementing character-level embedding using 1D Convolutions on top of the provided baseline code to mimic the BiDAF (Bidirectional Attention Flow) model. By adding the character-level embedding to the baseline starter code has given a lot of improvement to the EM and F1 scores. After running a lot of experiments, we found the best performing model to the one using an Adam optimizer with one char CNN embedding layer with Batch Normalization, learning rate of 0.0003 and dropout of 0.13. The scores received in the test leader-board are as follows: F1 - 66.174 and EM - 63.077.

Examining the Effectiveness of a Mixture of Experts Model with Static Fine-tuned Experts on QA Robustness

While much progress has been made in recent years on modeling and solving natural language understanding problems, these models still struggle to understand certain aspects of human language. One of the most difficult areas for current models is generalization. While humans can easily generalize beyond a training data set, computers often have difficulty developing non-superficial correlations beyond the provided data. In this project, we tackled this concept of computer generalization through the development of a robust question answering (QA) system that is able to generalize answers to questions from out-of-domain (OOD) input. Here, we applied a modified Mixture of Experts (MoE) model, where gating and expert training are handled seperately, over the 6 datasets in order to create robustness through specialization of the various expert models. We also applied few-sample fine-tuning to large and small components of the model to try to better account and generalize for cases where there is little data. Ultimately, from the results of the model, we observed that this modified MoE architecture has several limitations through its expert and training method and was unable to improve significantly on the baseline of the model. In addition, we also observed that the few-sample fine-tuning techniques greatly improved the performance of the small, out-of-domain expert but barely improved, and sometimes harmed, models with a larger dataset. As a whole, this paper illustrates the potential limitations of applying a simple MoE model and few-sample fine-tuning to the complex task of generalization and may suggest the implementation of more advanced structures and techniques are necessary for strong performance.

BiDAF Question Ansering with Character Embedding, Self-Attention, and Weighted Loss

Project summaries unavailable

Invertigation of BiDAF and implementation of QANet for Question Answering

In this project, I build two question answering system that have relatively good performance on SQuAD 2.0 dataset. The baseline model is Bi-Directional Attention Flow (BiDAF), which achieved 59.21 F1, 55.92 EM and 65.85 AvNA on Dev dataset. Firstly I implement a CNN-based character embedding to it which achieved 60.192 EM, 63.480 F1 on Dev dataset. Then I re-implement QANet with Pytorch which is basically the same as the original paper proposed one. It achieved 59.973 EM, 63.403 F1 on Dev dataset, which is less than the first one. Ultimately, I got 59.307 EM and 62.761 F1 on test set.

Gaining More from Less Data in out-of-domain Question Answering Models

We propose text augmentation techniques for Question Answering task in NLP that involves using synonyms with stochasticity on out-of-domain datasets (DuoRC and RACE and RelationExtraction) that are set to be 400 times smaller than the in-domain datasets (SQuAD, NewsQA, NaturalQuestions). We illustrate QSR, SIBA, SIAA, CCS and CD augmentation strategies above, that help improve extraction of generalized information from out-of-domain or less available datasets from large pre-trained models BERT variant DistilBERT being able to benefit from producing QA applications across domains. It is found that augmenting less available QA datasets in a way described, indicate improvement in generalization, but not all augmentations strategies are equally good, a combination of 3x QSR, 3x SIBA, 3x SIAA, 3x CCS performed the best (as illustrated above) with exclusion of CD (this negatively impacted scores). We also define a metric EM+ (exact match plus) that is a binary measure if prediction is a superset of the answer, EM+ = 1, else 0; provides a less overfit-perspective as a performance metric than EM. We conjecture from analysis done in the paper that increasing unique words in OOD that aren't present in ID, help improve with performance.

BiDAF with Dependency Parse Tree for Question Answering in SQUAD 2

One of the key areas of interest in Natural Language Processing is building systems capable of answering questions in our native language. The task is called Question Answering (QA) and is the focus of this paper where we explore our idea to enhance an existing solution called BiDAF (Seo et al, 2016). Our intuition is that language understanding involves at least two broad capabilities. First one has to understand what words individually mean. And second, based on the structure of the sentences one has to make sense of the complete sentence. Individual word are usually represented by word embeddings in most solutions. But the second piece is where different approaches diverge greatly. To address this part, we were interested to see, if syntactic information can help. Specifically, we explored the idea of using dependency parse trees (DPT) to enrich the embedding of individual words. DPT provides a representation of syntactic relationships between words in a sentence. We defined the relationship between words as the path between them in the dependency tree. We hypothesized that even though grammatical structure doesn't enable a system to do a lot of things such as reasoning, the best a model could do with a limited dataset is to learn the patterns between syntax of questions with that of the answer phrases. This inspired us to augment the input word embeddings to the model with dependency parse tree based information. Our model not only scored significantly higher (+7% on F1 & EM) compared to the baseline, it also learned almost twice as fast even with the extra preprocessing time. DPTs are produced by deep learning model, so end to end there is in no manual feature engineering. We find this idea particularly interesting as it could be potentially added to other QA models with minimal adaptation.

Exploring First Order Gradient Approximation Meta Learning for Robust QA Systems

Reptile is a meta learning approach that searches for initial model parameters to allow a model to be fine tuned with a small dataset. However when fine tuning a language model on a small set of tasks and low learning rate, Reptile may still over-fit on training batches. RandEPTILE adds additional noise to initial model parameters to efficiently search for areas of lower validation loss in the parameter domain. This project explored the effects of RandEPTILE with a distilBERT pre-trained model for question answering using small fine-tuning datasets. While the improvement on final test accuracy was inconclusive, adding additional noise to model parameters could be worth exploring in future meta learning techniques.

Self-Attention in Question Answering

For the default final project, our task was to build a model that performs question answering over the Stanford Question Answering Dataset (SQuAD). Our goal was to improve on the baseline BiDAF model's F1 and EM scores on the task. To do so, we made two additions to the model: character embeddings and a self-attention layer, both which were used in R-Net. We found that while these additions improved the F1 and EM scores, it also required significantly more memory and training time.

Faster Attention for Question Answering

In this project (a default final project on the IID track), I built a question-answering system for SQuAD 2.0 by exploring both the BiDAF model through modifications of the default baseline as well as a from scratch implementation of QANet, a self-attention-based question-answering architecture. The BiDAF modifications which added character embeddings achieved a small, but significant improvement over the baseline model on the test set. However, the QANet models only nearly matched the baseline BiDAF scoring with character embeddings. Curiously, not only did my QANet under-perform the baseline in model performance, it also turned out to be significantly slower to train and at inference time on GPUs. Though profiling, I found that the QANet model is indeed faster on CPUs, however significantly under-performs the baseline BiDAF model on GPUs because the BiDAF model's slowest component, the RNN, is implemented as a highly optimized CuDNN routine on GPUs that the custom QANet encoder block did not benefit from. Finally, this profiling also shows that faster attention mechanisms, as explored in the literature, are unlikely to improve performance on this particular SQuAD 2.0 workload as additional instruction overhead would likely wash out any performance gains absent better operation compilation for GPUs or a custom GPU kernel.

Efficiency of Dynamic Coattention with Character Level Embeddings

Question answering has long been a difficult task for computers to perform well at, as it requires a deep understanding of language and nuance. However, recent developments in neural networks have yielded significant strides in how well computers are able to answer abstract questions; concepts like dynamic coattention and character level embeddings have helped machines with abstract tasks like reading comprehension. Despite these strides, training models utilizing these techniques remains cumbersome and exceedingly time consuming. We explored a handful of different approaches on improving the SQuAD evaluation score within the context of coattention models. Immediately, we noticed character-level embeddings increase evaluation metrics by a few points and decided to explore coattention models with character-level embeddings. The performance of our coattention models without a dynamic decoder performed significantly worse than the baseline. We noted how removing the modeling layer reduced the training time in half while achieving a similar performance. We hypothesized that the coattention model did not perform as well because the character-level embeddings introduced unnecessary and irrelevant similarities between the question and context embedding. Furthermore, we noted that there were some variance in the training runs especially in the F1 score. Some potential avenues for future work can explore removing character-level embeddings, reintroducing a dyamic decoder and observing the performance between a coattention model with and without a modeling layer to see if there are still improvements in training time. Furthermore, it would also be interesting to further explore the QANet model to understand how they intended to improve on training time.

Question Answering on SQuAD2.0

We chose the default project to build a Question Answering system on the SQuAD 2.0 dataset. Our initial approach to solve this problem focused on implementing the default baseline model that is based on a variant of Bidirectional Attention Flow (BiDAF) with attention. We explored performance after adding character level embeddings to the baseline along with exploring various attention mechanisms. Additionally, we also explored the impact of tuning the hyper-parameters used to train the model. Finally, we studied the effect of using multiple variants of RNN as building blocks in the neural architecture. We improved the model performance on both dev and test sets by at least 4 points. The baseline F1 and EM scores without character embedding were 60.65 and 57.13 while our best improvements with BiDAF, Character Embedding, Self-attention with LSTM were 65.80 and 62.99 respectively. The scores would have been better with pre-trained models however, for our track it was prohibited. Even if we could improve the performance by a bit, question answering remains a challenging problem with a lot of scope of improvement. Also, we need to make sure that the current model generalizes beyond SQuAD dataset. This course was our first foray in the field of NLP and we have developed a deeper understanding about the advances and challenges in Natural Language Understanding and processing and hope to keep improving it with time.

Improving Out-of-Domain Question Answering with Mixture of Experts

Question answering (QA) is an important problem with numerous applications in real life. Sometimes, the resource of certain QA tasks is limited. Our work aims to build a robust QA system that can generalize to novel QA tasks with few examples and gradient steps. We propose a Mixture-of-Experts (MoE) style training framework, where we learn a gating network to construct the embeddings by performing a weighted sum of the base "expert" models with fixed parameters. We find that using the mixture of expert models improves generalization performance and reduces overfitting, especially when using "expert" models trained with data augmentation. We use meta-learning methods, specifically the MAML algorithm, to train the gating network for domain adaptation. Training the gating network with the MAML algorithm and finetuning on out-of-domain tasks improved out-of-domain QA performance of baseline models on all metrics. We also discovered a correlation between expert-model performance and the weight the MoE framework puts on each of them. Our approach achieves a F-1 score of 60.8 and EM score of 42.2 on the out-of-domain QA testing leaderboard.

Tackling SQuAD 2.0 Using Character Embeddings, Coattention and QANet

Question Answering (QA) systems allow users to retrieve information using natural language queries. In this project, we are training and testing QA models on SQuAD 2.0, a large dataset containing human-labelled question-answer pairings, with the goal of evaluating in-domain performance. Using a Bidirectional Attention Flow (BiDAF) model with word embeddings as a baseline, we identified, implemented and evaluated techniques to improve accuracy on the SQuAD task. Our initial experiments, which added character embeddings and a coattention layer to the baseline model, yielded mixed results. Therefore, we started over with a new model using Transformer-style encoder layers, based on the QANet. This model posed many challenges, particularly in adapting to the unanswerable component of the SQuAD 2.0 dataset, and thus did not come close to achieving the performance of BiDAF-based models.

Improving Out-of-Domain Question Answering with Auxiliary Loss and Sequential Layer Unfreezing

The proliferation of pretrained Language Models such as BERT and T5 has been a key development is Natural Language Processing (NLP) over the past several years. In this work, we adapt a DistilBERT model, pretrained on masked language modeling (MLM), for the task of question answering (QA). We train the DistilBERT model on a set of in-domain data and finetune it on a smaller set of out-of-domain (OOD) data, with the goal of developing a model that generalizes well to new datasets. We significantly alter the baseline model by adapting an auxiliary language modeling loss, adding an additional DistilBERT layer, and undergoing training with sequential layer unfreezing. We find that adding an additional layer with sequential layer unfreezing offered the most improvement, producing a final model that achieve 5% over a naive baseline.

Exploring Combinations of Character Embeddings and Coattention

In this project, I attempt to build a model for the Stanford Question AnsweringDataset (SQuAD) v. 2.0 [1]. I consider 3 different models, the baseline model,or Bi-directional Attention Flow (BiDAF) without character level embedding [2],BiDAF with character level embedding, and a Dynamic Co-attention Network [3]with character level embedding. Some conclusions drawn from my experiment wasthat implementing character level embedding in the BiDAF model significantlyimproved EM and F1 scores over the baseline. However, even though the DynamicCo-Attention Network with character level embedding was an improvement overthe baseline, it scored lower on both F1 and EM scores than BiDAF with characterlevel embedding. On the development set, the BiDAF with character embeddinghas an F1 score of 63.030 and EM score of 59.839. The Dynamic Co-attentionNetwork with character embedding has an F1 score of 61.54 and an EM of 57.81.My best result on the SQuAD testing set was the BiDAF with character embeddings,achieving an F1 score of 62.266 and an EM score of 58.952.

Question Answering on SQuAD 2.0 using QANet with Performer FastAttention

Transformers are excellent but scale quadratically with sequence length, resulting in bottlenecks with long sequences. Performers introduce a provably accurate and practical approximation of regular attention, with linear space and time complexity. In this project, we implement the QANet model for the SQuAD 2.0 challenge, then replace self-attention layers in the encoders with Performer Fast Attentions to improve training speed by 18%.

Question Answering by QANet and Transformer-XL

Project summaries unavailable

Extending BiDAF and QANet NLP on SQuAD 2.0

By exploiting self-matching attention in BiDAF and multihead attention in QANet, our project demonstrates that attention helps to cope with long term interactions in the neural architecture for question answering system. Our addition of self-matching attention in BiDAF matches the question-aware passage representation against itself. It dynamically collects evidence from the whole passage and encodes the evidence relevant to the current passage word. In QANet, convolution and self-attention are building blocks of encoders that separately encode the query and the context. Our implementation of multihead attention in QANet, ran through the attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension. Multiple attention heads allow for attending to parts of the sequence differently, so longer-term dependencies are also taken into account, not just shorter-term dependencies. We saw some interesting trends while doing Qualitative error analysis of our output. Model was able to answer "who" questions better than "what" questions. When the "what" question was framed differently, like “Economy, Energy and Tourism is one of the what?” Even though the passage contains the answer, the model could not predict. Also, we observed wrong predictions in general for questions involving relationships, like: "Who was Kaidu's grandfather?" The passage did not mention it explicitly "Kaidu's grandfather was ...", however it had the clue: "Ogedei's grandson Kaidu ...", but it could not interpret the correct answer from the passage and instead provided a wrong answer. We also noticed the model could not predict at all a lot of "which" questions. Further analysis revealed that those "which" questions require a bit more contextual understanding. It was a good learning experience and the model prediction provided a lot of clues as to how we can improve the model to the next level.

Building a Robust QA System that Knows When it Doesn't Know

Machine Learning models have a hard time knowing when they shouldn't be confident about their output. A robust QnA module should not only be able to do a good job at out of context data, but also be able to do a good job of knowing what data it can't handle. The goal of our project is to build a robust QnA model with an architecture that relies on a base of DistilBERT, improve on it through model fine-tuning, better optimization, and then augment the predictions of the model with a confidence score Our approach for this project was forked in two directions. 1. Focus on fine-tuning the model through approaches like transfer learning, longer epochs, mix-out and re-initializing layers. 2. Augment the model by providing a confidence score to enhance the model's reliability in real world usage. BERT models use the base weights from pre-training and then fine-tune on specific datasets. They are pre-trained on a variety of tasks making it easier to generalize but it needs to be further fine-tuned for specific task. Also, the fine tuning process is susceptible to the distribution of data in the smaller datasets. We aim to improve on this by training on larger epochs, freezing all but the last layers of the BERT model, re-initializing the pre-trained model weights, using a regularization technique called mixout, use the bias correction and finally add additional layers to the model. The learnings from the experiments were: 1. Bias correction doesn't have any significant impact on the performance 2. Freezing the initial layers of DistilBERT doesn't impact the performance but it does speed up the training time 3. Re-initializing the lower layers have a positive impact on the performance of the model 4. Applying regularization in form of mixout increases the overall accuracy of the model

Improving Domain Generalization for Question Answering

Domain generalization remains a major challenge for NLP systems. Our goal in this project is to build a question answering system that can adapt to new domains with very few training data from the target domain. We conduct experiments on three different techniques: 1) data augmentation, 2) task-adaptive pretraining (TAPT), and 3) multi-task finetuning to tackle the problem of producing a QA system that is robust to out-of-domain samples. We found that simply augmenting the in-domain (ID) and out-of-domain (OOD) training samples available to us, specifically using insertions, substitutions, swaps and back-translations, boosted our model performance with just the baseline model architecture significantly. Further pretraining using the masked LM objective on the few OOD training samples also proved to be helpful for improving generalization. We also explored various model architectures in the realm of multi-task learning and found that jointly optimizing the QA loss with MLM loss allowed the model to generalize on the OOD samples significantly, confirming existing literature surrounding multi-task learning. Hoping that these gains from data augmentation, adaptive pretraining, and multi-task learning would be additive, we tried combining the techniques but found that the sum of the techniques performed only slightly better and sometimes worse than the smaller underlying systems alone. Our best model implements data augmentation on both ID and OOD train datasets with the DistilBERT base model and achieved EM/F1 scores of 35.34/51.58 on the OOD dev set and 42.32/60.17 on the held-out test set. We infer that we've comfortably met our goal of beating the baseline model's performance as the baseline model achieved 32.98/48.14 on the OOD dev set.

Improving Out-of-Domain Question Answering Performance with Adversarial Training

In this project, we aim to investigate the effectiveness of adversarial training on improving out-of-domain performance of question answering tasks. We show that finetuning a pretrained transformer with adversarial examples generated with Fast Gradient Method (FGM) using in-domain training data consistently improves the out-of-domain performance of the model. We also analyze the performance difference in terms of computation cost, memory cost and accuracy between a variety of hyperparameter configurations for adversarial training.

Building Robust QA System with Few Sample Tuning

In this study we aim to modify three sub-optimal general practices of DistilBERT fine-tuning specifically for the question-answering language task, in order to improve both the predicting stability and performance of the model trained by the out-of-domain few samples datasets. We have implemented bias correction for the optimizer, re-initialization of the last transformer block and increase of the training iterations. With smaller sample datasets in the repeated experiment, the major finding is that the F1 score of the model performance has been improved by re-initialization but not by the other two implementations. It is also shown that the stability of finetuned model performance is improved by these implementations even though the improvements are not all statistically significant. In addition, we carry out an additional augmentation step of synonym substitutions for training datasets and show that both F1 and EM (Exact Match) scores are improved in the repeated experiments, with or without last layer re-initialization. Finally, we build a robust ensemble model based on six models that includes data augmentation with and without last layer re-initialization. Our model achieved performances of 43.096/62.205 (EM)/(F1) on out-of-domain test datasets.

Comparing Mixture of Experts and Domain Adversarial Training with Data Augmentation in Out-of-Domain Question Answering

Generalization is a major challenge across machine learning; Question Answering in Natural Language Processing is no different. Models often fail on data domains in which they were not trained. In this project, we compare two promising, though opposite, solutions to this problem: ensembling specialized models (a Mixture of Experts approach) and penalizing specialization (Domain Adversarial Training). We also study the supplementary effects of data augmentation. Our work suggests that Domain Adversarial Training is a more effective method at generalization in our setup. We submit our results to the class leaderboard where we place 20th in EM.

Rediscovering R-NET: An Improvement and In-Depth Analysis on SQUAD 2.0

Question-answering is a discipline within the fields of information retrieval (IR) and natural language processing (NLP) that is concerned with building systems that automatically answer questions posed by humans. In this project, we address the question-answering task by attempting to improve the R-NET model. Specifically, our goals are to 1. reproduce R-NET and evaluate its performance on SQuAD 2.0 compared to that on the original SQuAD dataset and 2. change certain features of the R-NET model to further improve its accuracy on SQuAD 2.0. We present an implementation of R-NET using LSTM's instead of GRU's, larger embedding and hidden dimensions, higher dropout, and more layers that achieves an improvement in performance from our baseline R-NET model.

Reimplementing Dynamic Chunk Reader

Some SQuAD models calculate the probability of a candidate answer by assuming that the probability distributions for the answer's start and end indices are independent. Since the two do depend on each other, it should be possible to improve performance by relaxing this assumption and instead calculating the probability of each candidate answer span's start and end indices jointly. We do so by reimplementing the Dynamic Chunk Reader (DCR) architecture proposed in Yu et al.\cite{yu2016end}, which dynamically chunks and ranks the passage into candidate answer spans using a novel Chunk Representation Layer and Chunk Ranker Layer. We implemented this model on the SQuAD 2.0 dataset instead of Yu et al.'s SQuAD 1 implementation. Our results performed more poorly than the baseline, which may indicate that the DCR architecture may not apply well to the SQuAD 2.0 task, or that we may have misinterpreted certain implementation details from the original paper.

ALP-Net: Robust few-shot Question-Answering with Adversarial Training, Meta Learning, Data Augmentation and Answer Length Penalty

While deep learning has been very successful in the question answering tasks, it is very easy for models trained on a specific data to perform badly on other dataset. To overcome this, In our paper, we proposed ALP-Net to build a robust question answering system that can adapt to new tasks with few-shot learning using answer length penalty, data augmentation, adversarial training and meta learning. 1. First, We proposed a new answer length penalty that penalizes the model if the predicted answer is too long, as the baseline QA model tends to generate very long answers. This simple optimization is proved to be very effective in shortening the answers and improving Exact Match. 2. We also applied data augmentation to generate new data for low-resource datasets by doing synonym replacement and word addition. With data augmentation, the model is more unlikely to learn brittle features such as the occurrences of certain words and fixed answer positions, leading to improved F1. 3. ALP-Net also adopted adversarial training. We applied a discriminator to determine whether the features learned by the model are domain specific. With adversarial learning, models can learn domain agnostic features that could be applied to unseen domains. We found that while being effective in the few-shot learning task, adversarial training should not be used on out-of-domain training data to keep its domain knowledge. 4. We also tried meta learning to adopt the mean of different sets of model parameters learned from data of different domains. However, it did not perform well and we found that it is hard to learn general knowledge across domains for question answering tasks. Among these approaches, data augmentation and answer length penalty contribute the most to our model performance, allowing us to achieve 60.962 F1 and 43.005 EM score on out-of-domain datasets test data.

Improving Question Answering on SQuAD 2.0: Exploring the QANet Architecture

In this project, we investigated QANet - an end-to-end, non-recurrent model that is based on the use of convolutions and self-attention. Our first goal was to reimplement the QANet model from scratch and compare its performance to that of our baseline BiDAF - a model that relies on recurrent neural networks with attention. Both of the QA answering systems were tested on SQuAD 2.0 which includes both questions that are answerable given a context and questions that are not answerable given the context. Finally, after evaluation of our "vanilla" QANet and investigation of related work, we implemented an extended model called EQuANT. The model adds an additional output to explicitly predict the answerability of a question given the context. Our best model (QANet with tuned hyper-parameters) achieves F1 = 57.56 and EM = 54.66 on the developmental set, and F1 = 56.76 and EM = 53.34 on the test set.

Bidirectional Attention Flow with Self-Attention

I extended the BiDAF model with varies optimization techniques on the SQuAD 2.0 dataset. With character embedding and multi head self attention been added to the model, my results shows an improvement of +4 point on the EM and +4 point on F1 score compared with the default project. The performance is as expected, but there are also rooms for improvements. One notable finding is I could also generate a masking for each word while training to force the attention computation not focus on the current word but other words of the given inputs.Right after the completion of the project report, i have noticed that other findings reported that a pure Self-Attention is not that helpful without the bias and rank collapse. It seems a pure self attention layer can be converted into a shallow network

QANet without Backtranslation on SQUAD 2.0

This paper investigates two different approaches to the question answering problem on the SQuAD 2.0 dataset. We explore a baseline model based on the BiDaF architecture, and improve its performance through the implementation of character embeddings and hyperparameter tuning. Further, we implement variations on the convolution and self-attention based QANet architecture. While the original QANet architecture uses backtranslation to do data augmentation, we explore a simple and effective method that does not have dependencies on machine translation systems to do augmentation. This involves concatenating contexts together and reusing the same query/answer to generate a new answerable query, and dropping an answer span from the context of an answerable query to create an unanswerable query. The effectiveness of this approach demonstrates the importance of data augmentation for the QANet model. Finally, we form an ensemble model based on our different experiments which achieves an F1 score of 70.340 and an EM score of 67.354 on the test set.

"Pointed" Question-Answering

Machine reading comprehension through question-answering is one of the most interesting and significant problems in Natural Language Processing because it not only measures how well the machine 'understands' a piece of text but also helps provide useful answers to humans. For this task, given a paragraph and a related question, the machine's model must select the span from the paragraph that corresponds to the answer using a start index prediction and end index prediction. My baseline model for this task is a Bidirectional Attention Flow (BiDAF) end-to-end neural network, with embedding, encoder, attention, modeling and output layers. Significantly, the output layer involves the probability distribution of the start index token and end index token to be generated independently. However, in order for the model to learn how the end of an answer can depend on the start of an answer, I implement a boundary model of an Answer Pointer layer (introduced by Wang et al, 2017) based on the notion of a Pointer Network (Vinyals et al, 2015) as a replacement for the output layer of the baseline. This enables us to condition the prediction for the end token on the prediction for the start token of the answer in the input text. Further, since a Pointer Network outputs a probability distribution exclusively over locations in the input paragraph (context) at each step instead of outputting a probability distribution over the entire vocabulary, it allows us to improve the model's efficiency in addition to its accuracy. On testing this new model, I obtain an F1 score of 59.60 and an EM score of 55.01 on the development set, which is an improvement on the performance of the baseline - involving both F1 and EM scores of 52.19 on the development set.

SQuAD: To QANet and Beyond

Project summaries unavailable

Experimenting with BiDAF Embeddings and Coattention

We are motivated by the task of question answering, which is a natural application of language models and helps evaluate how well systems understand the meaning within text. Our primary goal is to improve upon the baseline BiDAF model provided to us on the SQuAD 2.0 dataset, namely by experimenting with character-level embeddings, conditional end pointer predictions (Answer-Pointer network), self-attention, and coattention. We think that each of them leads in some way to an intuitive representation of language, linking it to larger aims within the field. Surprisingly, the coattention and self-attention modified models each score comparatively to or below the baseline model. Perhaps this hints at the importance of multiple layers for self-attention and word-to-word token interactions, as we only used one layer and a vectorized form of the original RNet self-attention paper. Our character-level embeddings + Answer-Pointer modified BiDAF performs best, scoring EM: 60.23 and F1: 63.56 on the dev set and EM: 58.715 and F1: 62.283 on the test set (compared to the baseline model with EM: 56.61 and F1: 60.24 on the dev set). The improvement might be attributed to a better understanding of out-of-vocabulary words and patterns in the grammatical structure of subsequence phrases. Compared to the baseline, the final model better predicts "No Answer"s and outputs semantically more logical context subsequences. However, the model still struggles with "why" questions and questions that contain different keywords than the context but have synonymous meaning (ex. "extremely short" in the context, "not long enough" in the question). Based on this error analysis, in the future we would love to explore euclidean distance between words and better beam search approaches to improve performance, as well as further analyze the failure cases of our self-attention / coattention implementations.

Adversarial Training Methods for Cross-Domain Question Answering

Even though many deep learning models surpass human-level performance on tasks like question answering when evaluated on in-domain test-sets, they might perform relatively poorly on out-of-domain datasets. To address this problem, domain adaptation techniques aim to adapt models trained for a task on in-domain datasets to a target domain by using efficiently samples from the latter. On the contrary, domain generalization techniques aim to incentivate the model to learn domain-invariant features directly from in-domain data to generalize the model for any out-of-domain dataset, pushing to learn task-relevant features and preventing overfitting on in-domain data. We like to compare this approach the way humans learn a task, as they can generally perform the same task on different domains from only a few examples. However, domain generalization is often performed by augmenting in-domain data by applying semantic-preserving transformations to challenge the model during training, leveraging some kind of rules or domain knowledge. Contrarily, in this project our goal is to explore domain generalization techniques applied to question answering based on adversarial training without leveraging any set of rules or domain knowledge but using adversarial terms to make more robust the regular loss with or without adopting task-agnostic critic networks. Such extremely general methodology does not suffer from the limitations of synonym replacement approaches and can be applied to other NLP tasks. Our best variant combines two different and complementary approaches of adversarial training on a DistilBERT baseline, achieving >3% F1-score improvement over the regular fine-tuning process, outperforming several other adversarial and energy-based approaches.

Context Demonstrations and Backtranslation Augmentation Techniques For a More Robust QA System

Because many real-world NLP tasks rely on user data that is not necessarily guaranteed to be in-distribution, it is critical to build robust question answering systems that can generalize to out-of-domain data. We aim to build a question answering system using context demonstrations and dataset augmentation via backtranslation on top of DistilBERT that is robust to domain shifts. Our method replicates one of the two approaches described in Gao et al. (2020), sampling and appending out-of-domain demonstrations to each training example when finetuning the model. Our method also augments the out-of-domain dataset from which demonstrations are sampled using backtranslation to generate in-distribution training examples. We find that the basic approach of simply appending randomly sampled out-of-domain demonstrations to in-domain contexts does not improve model F1 and EM score performance, but supplementing this approach by adding separator tokens between each demonstration and augmenting the out-of-domain training dataset using backtranslation improves model performance.

Building a QA System (IID SQuAD track)

In this project, we explored different techniques in the encoding layer, the attention layer and the output layer of an end-to-end neural network architecture for question answering. Experiment results show that better performance can be achieved with different enhancements on top of the baseline model. Especially, with extra character embedding and deep residual coattention, we can achieve EM of 61.17 and F1 of 64.97 in comparison to EM of 58.32 and F1 of 61.78 of the baseline BiDAF model. To better understand the behavior of the best performed model, we broke down the F1 score distribution for the development set and examined the performance across different context lengths, answer lengths, and question types. Furthermore, by inspecting some of the error examples, we found that the model performs poorly mainly when it involves reasoning or advanced/complicated sentence structures.

CS224N Default Final Project Report: Building a QA System Using BiDAF and Subword Modeling Techniques

In our project, we attempted to answer the question: How can we best adapt a baseline Bi-Directional Attention Flow (BiDAF) network to answer questions in the SQuAD dataset? Our baseline model achieved 57.54 EM and 60.90 F1 in the dev set. Based on this, we experimented with concatenating character embeddings with word embeddings and other forms of subword modeling, such as manually constructing a subword vocabulary of size 10,000 by using the Byte-Pair Encoding algorithm and splitting words into subwords. We found that using our subword embedding layer actually decreased performance, likely to due confusion generated when encountering out of vocabulary words. Our final system and best-performing model is the BiDAF network with the character embedding layer, where character and word embeddings are concatenated in equal part (50/50). Our best results achieved 60.595 EM and 63.587 F1 on the dev set and 59.222 EM and 62.662 F1 on the test set.

SQuAD 2.0: Improving Performance with Optimization and Feature Engineering

In this project, we significantly improved baseline performance on the SQuAD 2.0 question answering task through optimization and feature engineering. Instead of overhauling the original BiDAF network architecture, we focused on extracting as much information as possible from the input data, taking inspiration from the DrQA document reader. We first constructed character-level word embeddings via a 1D Convolutional Neural Network, and then added token and exact match features for both the context and question words. We also conducted thorough hyperparameter searches and experimented with various encoding methods, projection, and drop-out layers. Ensembling our best models by majority vote achieved validation set F1 and EM scores over 7 points higher than the baseline with comparable test set performance (F1=68.753, EM=65.714). Our findings suggest that feature engineering is a particularly effective approach to improve model performance in the absence of pretraining.

Building a Robust QA System Via Diverse Backtranslation

While question answering (QA) systems have been an active topic of research in recent years, these models typically perform poorly on out-of-domain datasets. Thus, the goal for our project was to build a question answering system that is robust to distributional shift. Utilizing a pretrained DistilBERT model as our baseline, we tested two adaptation methods: backtranslation and few-sample fine-tuning. Backtranslation, which involves translating input data into an intermediate language before translating back to the original language, is a common data augmentation technique in many NLP tasks. We found that implementing standard backtranslation on out-of-domain training examples yielded significant increases in Exact Match (EM) and F1 scores over our baseline model. We compared these results to several modified backtranslation schemes including one in which we combined backtranslation with techniques from few-sample fine-tuning. Ultimately, we found that combining few-sample fine-tuning techniques with backtranslation did not improve performance. Our best model achieved an EM of 42.225 and F1 of 59.162 on the test set, and an EM of 38.74 and F1 of 51.19 on the development set.

Question Answering with Binary Objective

We added a secondary binary objective of predicting answerability to QANet. As shown in the picture, this objective is computed using the three outputs from the modeling layer in QANet. More specifically, we concatenate the 0th words of m0, m1, m2 (these are the outputs of the first, second, and third pass of the modeling encoder) and pass it through a single feed-forward layer with sigmoid activation. Our results showed that adding this secondary objective resulted in meaningful improvements in both EM and F1 over our implementation of QANet, which mostly follows the official QANet but we added a project layer on the output of the context-query attention layer to reduce memory usage. We also were able to produce the performance gains from adding character-level encoding, replacing RNN with multi-head self-attention and convolutions, and applying layer-wise dropout (stochastic depth).

Extended BiDAF with Character-Level Embedding

With the rise of NLP and ML, we've seen much progress in regards to the task of machine comprehension and building robust question answering systems. we want to focus on investigating and improving the BiDAF model, starting from extending the baseline model by including character-level word embeddings. We then ran experiments using the improvements recommended in section 5.11 of the default project handout. Two major goals were accomplished: we implemented character-level embeddings and adjusted dropout rate and learning rate in addition to other hyper-parameters in order to improve our model. On our best model, we were able to achieve an F1 score of 65.106 and a EM score of 61.369 in the non-PCE division.

SQuAD - Refined Implementation of Contextually Enriching Passage Sequences (SQUAD-RICEPS)

Our default project took on the task of SQuAD 2.0 Question Answering using inspiration from an approach described in Christopher Clark's 2017 paper, "Simple and Effective Multi-Paragraph Reading Comprehension". We combine the embedding, encoding, and bi-attention of BiDAF with an additional two layers of self attention. Our findings see an improvement when using a TriLinear attention layer on top of a Multiheaded Scaled Dot Product Self Attention layer. While we had promising results with character embeddings on the dev set, we were unable to refine our implementation of character embeddings to improve our model. We were able to produce an EM score of 59.5 and an F1 score of 62.7 which improved on the BiDAF baseline's score of 56.3 and 59.4.

Building a Robust QA system with Data Augmentation

Pre-trained neural models such as our baseline model fine-tuned on a BERT based pre-trained transformer to perform nature language question and answering prob- lems usually show high levels of accuracy with in-context data, but often display a lack of robustness with out-of-context data. We hypothesize that this issue is not primarily caused by the pre-trained model's limitations, but rather by the lack of diverse training data that might convey important contextual information in the fine-tuning stage. We explore several methods to augment standard training data with syntactically informative data, generated by randomly replacing the grammatical tense of data, removing words associated with gender, race, or economic means, and only replacing question sentences with synonym words from a lexicon of words. We found that the augmentation method that performed the best was changing the grammar of more and one word in every question. Although it only made less than 1 point increase in the F1 and EM scores, we believe that if we also applied this method to the context and answers training data we would be able to see even more significant improvements. We were also surprised that the method of removing associations with gender, race, or economic status performed relatively well given that we removed a lot of words from the dataset.

Implementations of R-NET and Character-level Embeddings on SQUAD

While there have been many new and exciting developments in solving the SQuAD challenge over recent years, I decided to focus on the fundamentals in my final project approach. What better way to practice and reinforce classical deep learning concepts such as recurrent neural networks, convolutional networks and self-attention than implementing R-NET with added character-level word embeddings? My experiments showed that character-level emebeddings enrich the understanding of word components and provide improvement on key evaluation metrics. My implementation of R-NET also exhibits an additional lift in model performance on SQuAD 2.0. However, the limitations of R-NET are also highlighted as it struggles to identify unanswerable questions especially when similar phrases exist in both question and passage.

Stanford CS224N SQuAD IID Default Project

Being able to answer questions about a given passage marks a significant advancement in artificial intelligence. This task also has incredible practical utility, given the great need to have a personal assistant on our phones that can answer simple questions about world facts. In this project, we attempt to build a state-of-the-art model for question answering on the SQuAD 2.0 dataset via combining several different deep learning techniques. We iterated off of the baseline BiDAF model with various improvements such as feature engineering, character embeddings, co-attention, transformer models, and more. We had mixed success in getting all of these methodologies to fully run as anticipated and found many to not work as well as we had hoped. But we still managed to make significant improvements over the baseline by combining some of what we had implemented and performing a hyperparameter search. Our final model was quite successful on this front, achieving an F1 score of 63.517 and an EM score of 59.966 over the baseline's 58 F1 score and 55 EM score.

Building a QA system (Robust QA track)

While there have been great strides made in solving fundamental NLP tasks, it is clear that the models which tackle these problems fail to generalize to data coming from outside the training distribution. This is problematic since real-world applications require models to adapt to inputs coming from previously unseen distributions. In this paper, we discuss our attempt to create a robust system for extractive question answering (QA). We use a BERT variant as our baseline, and attempt four methods to improve upon it. Our first method is a model that uses the Mixture-Of-Experts (MoE) technique described in the "Adaptive Mixtures of Local Experts" paper and the Robust QA Default Project handout. The second is an original inference-time procedure which predicts the answer span that maximizes the expected F1 score. The third approach is to produce more out-of-domain training examples via data-augmentation. Our final and best-performing method is an Adversarial Training model described in "Domain-agnostic Question-Answering with Adversarial Training". The MoE model and expected-F1-maximization strategy fail to outperform the baseline's F1 score of 47.098, achieving F1 scores of 44.870 and 44.706 on the validation set respectively. Training the baseline with augmented data produces an F1 score of 48.04. Domain Adversarial Training gives the best results when coupled with data augmentation, yielding an F1 score of 51.17 on the validation set. However, we see that on the test set, none of our models were able to the beat the baseline's F1 score of 60.240.

Building a Robust QA system

Researchers today prioritize their time by building increasingly complex models that are harder to interpret and debug. The goal of this project is for us to discover how noninvasive techniques can be equally as effective. We explore how accuracy improves with hyperparameter tuning, various different methods of learning rate decay, and layer freezing. We also analyze the effects of data-side augmentations such as backtranslation, synonyms, masked learning, and upsampling. The last area of exploration is an altered loss function that biases against length. Our main conclusions support that fine tuning and data augmentation methods were the most critical in improving performance on question answering systems under domain shifts. We see that data augmentation (back translation and synonym translation) however can sometimes be too noisy depending on how many sequences of languages we filter through, suggesting that future work looks into understanding an optimal number of languages. We have inconclusive results on the quality of MLM and upsampling our dataset as we see marginal improvement at best from these methods, potentially suggesting that they are not worthwhile pursuing for such few sample finetuning. Lastly, we see that for future work further investigation into our added loss function could be potentially useful in regularizing response length.

RobustQA: Benchmarking Techniques for Domain-Agnostic Question Answering System

Despite all the hype about performances from large pretrained transformers like BERT and ROBERTA, it has been shown that Question Answering (QA) tasks still suffer challenges when there exists a large discrepancy between the training and testing corpus. The goal of our project is thus to build a question answering system that is robust to out-of-distribution datasets. We approach this challenge through data augmentation, where we hope to add label preserving invariances to the fine-tuning procedure to reduce the learned features specific to the in-domain data while increasing the number of the out-of-domain data that our QA model can generalize more broadly. Specifically, we paraphrased both the in-domain and out-of-distribution training sets by back-translating each query and context pair to multiple languages (Spanish, Russian, and German) using architectures that include a two-layer neural machine translation (NMT) system and pretrained language transformers. After back-translation, we iterate over all continuous subsets of words in the context sentence to find an approximate answer span that is the most similar to the original gold answer, and we filtered out examples with Generalized Jaccard similarity scores below 0.65 to ensure data quality. By fine-tuning the DistilBERT baseline on these augmented datasets, our best model achieved 51.28 F1 and 35.86 EM on the development set and 59.86 F1 and 41.42 EM on the test set.

Dataset Augmentation and Mixture-Of-Experts Working In Concert For Few-Shot Domain Adaptation Transfer Learning

Despite the significant improvements in NLP in the last few years, models can still fail to work well on test sets which differ, even a small amount, from their training sets. Few shot learning is an important goal in creating generalizable neural network models. In this paper we explore ways to increase the few shot learning performance of a model by implementing a few variations meant to improve generalizability; specifically we measure the effects of data augmentation and mixture of experts on a pre-trained transformer BERT model. Mixture of experts is a technique in which separate models are trained to be responsible for different sub tasks within a problem. We find that this change is able to remove the interference between out-of-domain datasets during training and increase performance from F1 48.43 to 51.54. Data augmentation applied for NLP is a technique in which words within a piece of text are added, removed, or replaced in an effort to increase the variance in training data. This method was found to be a valuable tool in further improving expert learning, increasing the overall F1 score further to 52.07, however it did not improve the baseline model when used on its own.

Longer-term dependency learning using Transformers-XL on SQuAD 2.0

I propose an application of the Transformer-XL attention model to the SQuAD 2.0 dataset, by first implementing a similar architecture to that of QANet, replacing the RNNs of the BIDAF model with encoders, and then changing out the self-attention layer to that of Transformer-XL. In traditional transformers, there exists an upper dependency length limit equal to the length of this context. The Transformer-XL addresses these issues by caching the representations of previous segments to be reused as additional context to future segments, thus increasing the context size and allowing information to flow from one segment to the next. This longer-term dependency capture can be particularly useful when applying transformers to domains outside of natural language. Only a small improvement is shown with the Transformer-XL / QANet combined model compared to the baseline BIDAF, but increased performance is expected with additional parameter finetuning.

BiDAF with Character and Subword Embeddings for SQuAD

In this paper, we have implemented subword embeddings and character-level embeddings on top of the word embeddings in the starter code. For the character embeddings, we followed the approaches outlined in the BiDAF paper[1]. The character's representation vectors were randomly initiated and then passed through a convolutional neural network. We then applied the ReLu function, as well as downsampling it using the maxpool function to get the representation vector for every word. For the subword embeddings, we utilized the implementation of the Byte Pair Encoding algorithm[2]. It segments the word by grouping character sequences that occur most frequently in its training data. We then looked up the representation vector for each subword, which is trained using the GloVe algorithm(The segmentation and vector representation are both implemented in the Python library bpemb)[3]. We utilized the maxpool function to get the representation vector of each word, and then used linear transformation to convert the input features to match the hidden layers. Finally, we concatenated the three types of embeddings and passed them through the Highway Networks. Among the different types of models we have experimented with, the model with the concatenation of word embeddings and character-level embeddings performs the best on the SQuAD v2.0 dev set: EM=61.39, F1=65.05. References [1]Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectionalattention flow for machine comprehension.arXiv preprint arXiv:1611.01603, 2016. [2]Benjamin Heinzerling and Michael Strube. Bpemb: Tokenization-free pre-trained subwordembeddings in 275 languages.arXiv preprint arXiv:1710.02187, 2017. [2]Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors forword representation. InProceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP), pages 1532-1543, 2014.

Improved QA systems for SQUAD 2.0

We worked on the default project: Building a question-answering system (IID SQuAD track). Motivated by recent publications (such as "Attention is All You Need,"" "Machine Comprehension Using Match-LSTM and Answer Pointer," and "Convolutional Neural Networks for Sentence Classification"), we decided to extend the baseline BiDAF model with implementations of a character embedding layer, an answer pointer decoder in place of the original output layer, and a self-attention layer immediately after the bidirectional attention flow layer. We experimented with two versions of character embedding layers, and found that back-to-back convolutional layers allowed for better performances. Our implementations dramatically improved learning speed in the training process. Through multiple rounds of training with various hyperparameters, we achieved F1 scores of 64.83 on the dev set and 63.37 on the test set. We anticipate that this work will aid in the continuing development of efficient question answering systems.

Meta Learning on Topics as Tasks for Robust QA Performance

Project summaries unavailable

Building a QA System (IID SQuAD Track)

I implemented three NLP models : (a) a 4-layer 6 attention heads transformer encoder model, (b) QANet model and (c) extending the baseline BiDAF model with character embeddings for the question-answering task on the SQuAD dataset. The transformer encoder model (Fig (a)) is fed the sequence: "" where and are two special tokens indicating the start of the question and start of context respectively. To allow the model to predict no-answer, the context is prepended with a special (out-of-vocabulary) token. The output of the 4-layer transformer encoder is fed to a feedforward layer which is again fed to two different feedforward layers each followed by softmax, to predict the start and end position of answer in the context. The QANet Model (Fig (b)) replaces the LSTM encoder in BiDAF with self-attention and depthwise separable convolution. The model uses an encoder block (on right in Fig (b)) which contains multiple depthwise separable convolution layers followed by self attention and feedforward layer. The embedding layer (with character embeddings) and Context-Query attention are same as in BiDAF. The output of Context-query attention is fed to a stack of three encoder blocks, where the output of first two and first & third are used to predict start and end position of answer respectively through a projection layer followed by softmax. The transformer encoder model achieves EM and F1 score of 52.19 and 52.19 respectively while for the QANet model the scores are 57.28 and 60.59 respectively on the dev set. The QANet model was trained for 28 epochs and I believe that training it for longer (like 40 epochs) is likely to improve its performance. Adding character embedding to the baseline BiDAF model improves the EM and F1 scores from 55 and 58 to 59.6 and 63.14 respectively on dev set.

Building QA Robustness Through Data Augmentation

While question and answering (QA) models have achieved tremendous results on in-domain queries, recent research has brought into question the ability of these Q&A models to generalize well to unseen data in other domains. To address this, we aim to build a robust question answering system, which trained on a set of in-domain data can then be adapted to unseen domains given few training samples. Our main approach is the field of data augmentation. In this work, we conduct a survey of existing data augmentation methods, including backtranslation, synonym replacement, and synonym insertion, as well as introduce a mixed data augmentation method (MDA) combining the previous three. For examples of backtranslation, synonym replacement, and synonym insertion, please see the displayed figure. The figure displays three examples for how one sentence might be augmented using each data method. In particular, we explore the efficacy of data augmentation in the task of question answering. We find that data augmentation provides moderate gains on our out of domain validation and test sets and that certain methods such as backtranslation and synonym replacement provide larger improvements compared to others. Overall, we confirm that data augmentation is a simple, generalizable technique with a wide variety of different methods that can effectively aid in improving the robustness of Q&A models in the face of unseen domains with few training examples.

The Efficient BiIDAF

In recent years, The massive pre-trained Language models have dominated the State-of-the-Art leaderboard across many NLP tasks including the Question Answering task on SQuAD 2.0. In this project, we travel back to a successful traditional approach known as Bi-Directional Attention Flow (BiDAF) which uses a sequence-to-sequence network. We identify the shortcomings of this model and implement a multi-stage hierarchical end-to-end network that solves the shortcomings of BiDAF. More specifically, the original model uses a sequence-to-sequence network like RNN to encode information of query/context into a vector. Even though RNNs' are known to be quite effective, they have few huge bottlenecks, namely, non-parallelizability of the network due to its seq-to-seq/time-step based computation, lack of transfer learning support, and vulnerability to vanishing/exploding gradient. We handle these shortcomings of RNN by replacing them with transformer encoders. Additionally, we implement few recent techniques to improve the vanilla encoder network, namely, Spatial Positional Encoding instead of traditional Absolute Positional Encoding, ScaleNorm instead of traditional LayerNorm, Feedforward Network with Gated Linear Unit instead of traditional Feedforward Network with RELU. Looking outside RNN, we replace the query-to-context and context-to-query Attention flow with Cross-Attention using a Multi-headed Attention mechanism. We show that multi-headed Cross-Attention works better than the traditional Attention Flow layer. Finally, we introduce pre-trained character embedding vectors that were extrapolated from the existing Glove pre-trained word embeddings. We also show that this improves the baseline BiDAF model by a considerable amount. Lastly, we show the results of our final model on the validation set and compare its performance with the baseline BiDAF model. Evidently, we can observe that our model is performing better than the original BiDAF in terms of latency, and accuracy. Our Model is also highly extensible since we use encoders and multi-head attention and they don't suffer from traditional seq-to-seq bottlenecks and are available to the use of transfer learning.

Embedding and Attending: Two Hearts that Beat as One

Neural attention mechanisms have proven to be effective at leveraging relevant tokens of the input data to more accurately predict output words. Moreover, incorporating additional embedding information significantly boosts performance and provides greater granularity of tokens at the character and word level. For these reasons, we focused on implementing various models that concern primarily the embedding layer and attention layers. In our project, we implemented three different attention mechanisms (co-attention from Dynamic Coattention Networks, key-query-value self-attention, and R-Net self-attention) in the domain of the Question-Answering (QA) paradigm. Our goal was to produce a model that is highly performant compared to the baseline BiDAF model on the Stanford Questioning Answering Dataset (SQuAD 2.0). We combined these attention mechanisms with character-level embeddings to provide more local contextual information, and finally enhanced these embeddings by including additional input features (part-of-speech and lemmatized forms of words). Lastly, we conducted a series of hyperparameter tuning experiments to determine the ideal hyperparameters that result in the greatest F1/EM scores. Augmenting the baseline with these techniques produced a significant improvement compared to the baseline. Our most performant model obtained an F1 score of 65.27 and EM score of 61.77 (an increase of 5.6% and 5.5%, respectively).

Investigating the effectiveness of Transformers and Performers on SQuAD 2.0

In this project, I explored aspects of the Transformer architecture in the context of question answering on SQuAD 2.0, the Stanford Question Answering Dataset. I split this exploration into several phases, which built upon each other. In Phase 1, I gained familiarity with the default baseline (based on BiDAF, a recurrent LSTM-based algorithm) by upgrading it to support character-level embeddings, in addition to the existing word-level embeddings. This resulted in a 2-point performance increase on all scoring metrics. In Phase 2, I incrementally refactored the baseline from BiDAF into QANet, a question answering architecture which is similar in structure but uses convolution and Transformers instead of recurrent neural networks. After hyperparameter tuning, I found this improved performance by an additional 3.5 points on all scoring metrics. In Phase 3, I replaced the Transformer with an architectural variant, the Performer, which aims to solve the issue of quadratic scaling in vanilla Transformers'runtime and memory usage by using kernel methods to approximate the self-attention calculation. I found that this was effective within QANet, enabling linear scaling from hundreds to tens of thousands of tokens, with minimal impact to performance. In Phase 4, I prepared to make use of this scale to support open-domain question answering. I wrote a TF-IDF based document retriever, which returned the most similar Wikipedia page to the current context passage. I found this to be reasonably effective in locating similar passages. Finally, in Phase 5, I fed this new input into QANet via a new, large Background input, which supplemented the existing Context and Question inputs. I upgraded QANet to support this by adding a Context-Background attention and a Query-Background attention layer to the current Context-Query attention layer. This appears to start training correctly, with training and validation loss both decreasing over time.

DA-Bert: Achieving Question-Answering Robustness via Data Augmentation

Pretrained models are the basis for modern NLP Question-Answering tasks; however, even state-of-the-art models are heavily influenced by the datasets they were trained on and don't generalize well to out-of-domain data. One avenue for improvement is augmenting the training dataset to include new patterns that may help the model generalize outside of its original dataset. In this paper, we explore improving model robustness in the question-answering task, where we have a query, a context (i.e. passage), and an answer span that selects a portion of the context. We utilize various data augmentation techniques including adding noise to our contexts and backtranslating (translating text to a pivot language and then back) both the queries and contexts. We find that leveraging the technique of backtranslation on the queries, both on in-domain and out-of-domain training datasets, greatly improves model robustness and gives a 3.7% increase in F1 scores over our baseline model without data augmentation. Further, within this approach of backtranslation, we explore the linguistic effect of particular pivot languages and find that using Spanish adds the greatest robustness to our model. We theorize that Spanish and potentially other Romance languages' linguistic similarity to English gives clearer and more helpful translations than other high-resource languages with different roots.

Exploring the Architecture of QANet

Before the advent of QANet, dominant question-answering models were based on recurrent neural networks. QANet shows that self-attention and convolutional neural networks can replace recurrent neural networks in question-answering models. We first implemented a version of QANet using the same architecture as that of the original QANet model, and then we conducted experiments on hyperparameters and model architecture. We incorporated attention re-use, gated self-attention, and conditional output into the QANet architecture. Our best QANet model obtained 59.3 EM and 62.82 F1 on the evaluation set. The ensemble of the two best QANet models and one BiDAF model with self-attention mechanism achieved 62.73 EM and 65.77 F1 on the evaluation set and 60.63 EM and 63.69 F1 on the test set.

Question Answering with Co-attention and Transformer

in this project, we implemented several improvements of question answering system based on SQuAD database including: 1) QANet 2) coattention 3) RNet. We built the models from scratch and evaluated against the EM and F1 scores. Our main goal is to explore through various techniques in the Question Answering System. In this process, we were able to practice our skills of implementing complex models according to their descriptions in literatures. We first implemented the co-attention layer, which did not improve the model performance. We then added character-level embeddings to the baseline model which improved the EM score to 60.59 and F1 score to 64.17. After that we implemented QANet which used convolutions to capture the local structure of the context and self-attention mechanism to model the global interactions between text. We built the QANet incrementally and implemented several model components. We eventually saw major improvements in both EM and F1 scores (64.49 and 69.62) compared to the baseline BiDAF model and BiDAF with character-level embeddings. At the same time, we implemented the Self Matching layer and the Pointer Network described in the RNet paper. The self-matching mechanism helps refine the attention representation by matching the passage against itself, which effectively encodes information from the whole passage. This is implemented on the top of character-level embeddings and the baseline. We tested several modifications of the RNet architecture including different gate attention recurrent network and output layer. While Self Matching improved the performance, the Pointer Network caused vanishing gradients. The self-matching layer combined with character-level embeddings improved the performance to 62.06(EM) and 65.53(F1). Among all techniques, QANet gives the best performance, and to our understanding, the reason is that the QANet can capture the local and global interaction at the same time with its complex model architecture containing both convolutions and attention-mechanism.

Meta-learning with few-shot models Analysis Final Project

This project focuses on understanding the various elements of Meta-learning and few-shot models and the effectiveness of the different detailed implementation approaches. Using the default RobustQA project as a baseline, we explored the different implementations of the Meta-learning algorithm, LEOPARD, and evaluate the impact on performance of the prediction accuracy. We have also experimented with the eval-every parameter to understand how fast each implementation can learn when presented with the out of domain questions initially. We found that the multiple datasets implementation of the Leopard algorithm yields the best few-shot result. On the first evaluation at step 0 (after 1 batch of data for learning) this implementation already achieving a result of a EM score of 34.55 (on the validation set) compared to the ~32 EM scores that the other implementation and the baseline are getting. However, after the model is trained for a longer time, we found that the baseline can actually achieve a better EM score overall with 42.202 on the test set. Although, the difference in the overall accuracy of the test set score are very small for different implementations, we found the more simple implementation yields better accuracy in the long run. Our key finding is that the design of few-shot learning algorithm or model is actually a trade off between few-shot accuracy and the overall highest achievable accuracy.

Extending a BiDAF model with DCN for Question Answering

Our goal in this project is to improve the performance of the Bidirectional Attention Flow (BiDAF) model for the NLP task of question answering on the SQuAD 2.0 dataset. To do this, we 1) integrate character-level embeddings into the baseline BiDAF model and 2) replace the default attention layer with a coattention layer. While adding character-level embeddings has shown to improve the baseline BiDAF model's EM and F1 scores substantially, their addition to the DCN model actually decreased its scores slightly. Moreover, transforming the BiDAF model into a Dynamic Coattention Network (DCN) decreased the model's performance. Thus, the best model architecture we found is BiDAF with character-level embeddings. Future work includes tuning hyperparameters, experimenting with data processing techniques, adding optimizations like the Adam optimizer, and exploring different forms of attention.

QANet for Question Answering on SQuAD2.0

In this project, we study the application of a QANet architecture to question answering on the SQuAD2.0 dataset. Question answering consists in training models to answer questions provided in natural language from either prodided or general context. The QANet architecture, originally presented in 2018, was a top performer on the original SQuAD dataset before the advent of pre-training. While the original SQuAD dataset only contained answerable questions, the creators of the dataset published the updated SQuAD2.0 dataset that contains unanswerable question and demonstrated that while it had little effect on human performance, it greatly reduced the effectiveness of existing models. We study how the QANet model fair on this dataset compared with a BiDAF baseline model, another high-performing model. We show that QANet's effectiveness drops, but that simple modifications to the original architecture allow significant improvements in overall performance. We also study the benefits of ensembling different architectures to improve final performance. We achieve EM and F1 scores of 63.415 and 66.734 on the test dataset.

Robust QA with Model Agnostic Meta Learning

One model, called BERT (Bidirectional Encoder Representations from Transformers), has achieved current state-of-the-art on metrics such as GLUE score, MultiNLI accuracy, and F1 score on the SQuAD v1.1 and v2.0 question answering datasets. BERT is pre-trained using unlabeled natural language data via a masked language model (MLM) method, it is then fine-tuned for next- sentence prediction and question answering tasks. Successfully adapting BERT to low-reource natural language domains remains an open problem. Previous approaches have included using multitask and meta-learning fine-tuning procedures. Using a variant of the Model Agnostic Meta Learning (MAML) algorithm from, researchers were able to show that meta learning procedures had a slight advantage in low-resource domain adaptation than multitask models. However the researchers experimented with only a few task distributions p(T) for the MAML algorithm, and while the results did show an improvement over multitask models, performance for certain task distributions on specific tasks was somewhat counterintuitive. In this paper, suggestions from a recent paper in the International Conference on Learning Representations (ICLR) are implemented to stabilize training of a MAML-type algorithm on a pre-trained variant of BERT called DistilBERT. Several task distributions and other MAML-specific hyperparameter initializations are implemented and analyzed and a classifier is trained to predict out-of-domain dataset type to better leverage task-specific fine-tuning. The image included indicates that certain tasks, like predicting for the race and relation extraction datasets, are distinguishable and that a MAML algorithm might not be able to leverage data from one to help the other. However, another task, like predicting on the duorc dataset that is shown to be fairly indistinguishable from the other two datasets, might be able to help the other two tasks out during training.

Exploring Improvements to the SQuAD 2.0 BiDAF Model

We have explored different deep learning based approaches to the question answering problem on SQuAD 2.0 using an improved version of the BiDAF model. Our baseline was provided by the default project starter code, and is a modified BiDAF that has only word embeddings and performs on SQuAD 2.0. We explored three areas of improvements: character embeddings, conditioning the end prediction on the start prediction, and adding a self-attention layer. We found that the biggest improvement was from the Condition End Prediction on Start Prediction and Self-Attention with an F1 and EM score of 65.285 and 61.758 on the test set respectively. The model with character embeddings scored a 59.96 on EM and a 63.24 on F1, and the model with character embedding and self attention scored a 63 on EM and a 66.2 on F1 (both for the dev set). In our error analysis, we discovered that generally, all models performed well on questions that began with "When", and performed poorly on questions that begin with "What" and "The". Our future work includes investigating how further extensions, like transformers, co-attention, and different input features affect performance. Overall, this project was very educational, as it allowed us to read through numerous papers that outlined breakthrough improvements to this problem, and enabled us to implement ourselves the methods described in the papers.

Domain-Adversarial Training For Robust Question-Answering

In this project, we created a domain-adversarial model to improve upon the baseline DistilBERT model on the task of robustly answering reading comprehension questions across domains. The way the adversarial model works is by creating a discriminator, which is trained to decide based on the last layer of our question-answering model which domain the question came from. Then, our question answering model is trying to not only answer questions correctly but also to trick the discriminator as much as possible, which forces it to prioritize features of the question and context which are not domain-specific in this final hidden layer. Our model got an EM score of 41.353 and F1 score of 59.803 on the test set.

Sesame Street Ensemble: A Mixture of DistiIBERT Experts

In this project, I attempt to finetune a pre-trained DistilBERT model to better handle an out of domain QA task. As there are only a few training examples from these outside domains, I had to utilize various techniques to create more robust performance: 1) implemented a mixture of local experts architecture and 2) finetuned a number of hyperparameters to perform best over this few shot learning task. Specifically, a separate DistilBERT model was finetuned on each of the in-domain datasets to act as an expert. The finetuning approaches focused on reinitializing a variable amount of final transformer blocks and training for a longer period. These two approaches were then synthesized to produce the final model. The results were negative. I speculate that this is because the domains covered by the experts were too distinct from that of the out-of-domain datasets. In future work, I would like to use data analysis to group similar training examples (across predefined datasets) to hopefully lead to more focused experts.

QA System with QANet

Question answering system has always been an active field in the Natural Language Processing (NLP) researches. In the past few years, the most successful models are primarily based on Recurrent Neural Networks (RNNs) with attention. Though a lot of progress has been made, due to its sequential nature, RNN's operations are unparallelizable, which makes both training and inference slow. In addition, with linear interaction distance, RNNs have difficulty in learning long dependencies. This is a severe problem in QA system, since the context are usually long paragraphs. Based on these problems, in this project, we implemented a QA model based on Transformer, hoping to achieve both accurate and fast reading comprehension. We focused on reading comprehension among all QA problems, which is to select a part of text from the given context to answer some certain question. Instead of LSTM, this model used convolution layers and self-attention to form encoders. Given a paragraph of context and a question, it will output the probability of each context word being the start or end of the answer. However, against our expectation, this model did not perform very well. The speed is low due to its large amount of parameters, and the accuracy cannot match that of BiDAF because of overfitting.

Improving the Robustness of QA Systems through Data Augmentation and Mixture of Experts

Despite the stunning achievements of question answering (QA) systems in recent years, existing neural models tend to fail when they generalize beyond the in-domain distributions. This project seeks to improve the robustness of these QA systems to unseen domains through a combination of Easy Data Augmentation (EDA) and Mixture of Experts (MoE) techniques. As baseline, we finetuned a pre-trained DistilBERT model with Natural Questions, NewsQA and SQuAD datasets using the default configurations and evaluated the model performance on the out-of-domain datasets, including RelationExtraction, DuoRC, and RACE. After obtaining our second baseline by including a small number of training examples from our out-of-domain datasets, we ran two rounds of hyperparameters tuning through random search. Based on the best performing set of hyperparameters, we then augmented our out-of-domain datasets using the EDA techniques and analyzed the effects of each technique through a series of experiments. Finally, we implemented an MoE model with three experts and a two-layer bi-directional LSTM followed by a linear layer as the gating function. Both the data augmentation technique and the mixture-of-expert approach demonstrated capability to improve the robustness of DistilBERT-based QA systems, and a combination of the two methods brings even further improvement. The combined approach increased the F1 and EM scores on the dev set by 15.03% and 14.87%, respectively, compared to the baseline, and achieved an F1 score of 62.062 and an EM score of 42.317 on the test leaderboard.

Towards a Robust Question Answering System through Domain-adaptive Pretraining and Data Augmentation

Large pretrained language models have shown great success over a bunch of tasks in the past few years. These large language models are trained on enormous corpus, and it now becomes a question whether they are robust to domain shift. We find in this paper that the domain of question answering (QA) problems has significant impact on the performance of these fine-tuned LMs and these fine-tuned QA models are still sensitive to domain shift during test time. This potentially causes problems in many real-word applications where broad or evolving domains are involved. So, how can we improve model robustness? In this paper, we offer two potential solutions. First, we propose to continue pretraining on the objective domains. This second-phase of pretraining helps model focus on information that is relevant to the problem. We find that domain-adaptive pretraining helps improve out-of-domain test performance. In some cases, we might have additional small amount of training data on the test domain. We propose to use data augmentation tricks to maximally utilize these data for domain adaptation purpose. We find that data augmentation tricks, including synonym replacement, random insertion and random deletion, can further improve the performance on out-of-domain test samples. Our work shows that the improvements in performance from domain-adaptive pretraining and data augmentation are additive. With both methods applied, our model achieves a test performance of 60.731 in F1 score and 42.248 in EM score. The experiments and methods discussed in this paper will contribute to a deeper understanding of LMs and efforts towards building a more robust QA system.

QA System Using Feature Engineering and Self-Attention (IID SQuAD track)

Machine reading comprehension is an exceedingly important task in NLP and is a desired feature in many of the latest consumer and research projects. Therefore, using this task as motivation, we set out to build a reading comprehension model that performed well on the SQuAD 2.0 question answering dataset. To do this, we built upon the existing BiDAF machine comprehension model given to us through the CS224n staff. Our contributions to this model are a character embedding layer on top of the existing word embedding layer, a self attention layer, and added features to the character and word embeddings which include Part of Speech tags (POS), named entity recognition (NER) tags, and dependency tags. As a result of implementing these layers we found that character embedding with additional input features performed the best with an F1 dev score of 64.38 and an EM dev score 61.29. On the test set we achieved F1 and EM scores 62.17 and 59.04 respectively.

Coattention, Dynamic Pointing Decoders & QANet for Question Answering

The task of question answering (QA) requires language comprehension and modeling the complex interaction between the context and the query. Recurrent models achieved good results using RNNs to process sequential inputs and attention components to cope with long term interactions. However, recurrent QA models have two main weaknesses. First, due to the single-pass nature of the decoder step, models have issues recovering from incorrect local maxima. Second, due to the sequential nature of RNNs these models are often too slow for both training and inference. To address the first problems, we implemented a model based on Dynamic Coattention Network (DCN) that incorporates a dynamic decoder that iteratively predicts the answer span. To improve the model efficiency, we also implemented a transformer based recurrency-free model (QANet), which consists of a stack of encoder blocks including self-attention and convolutional layers. On the Stanford Question Answering Dataset (SQuAD 2.0), our best QANet based model achieves 68.76 F1 score and 65.081 Exact Match(EM) on dev set and 66.00 F1 and 62.67 EM on the test set. A high level model comparison of DCN and QANet is illustrated in the image.

Default Final Project: RobustQA Track

Our goal is to build a question answering system that can adapt to unseen domains with only a few training samples from the domain.. We experimented with several approaches, including mixture of experts approach and various techniques to fine tune the pre-trained model better. Although we are able to to outperform the baseline, we found that model architecture is less important when it comes to improving performance. Relevant training data is by far the most important factor. Various fine tune techniques also help to some extend

Robust Question Answering Through Data Augmentation and TAPT

In this project, we aimed to improve on the given baseline model, which is a DistilBERT pretained transformer, as much as possible in order to make it more robust to out-of-domain data for the task of QA. In order to do this, we experimented with a variety of extensions to the baseline, among which are Task-Adaptive Pretraining and data augmentation. We found that data augmentation was able to improve the results of the baseline the best out of our various attempts. Our best model performed better than the baseline by 0.287 points for the F1 score and 0.941 points for the EM score on the test set.

Transformer Exploration

In this project we we build a question answering model for the SQuAD 2.0 dataset. Beginning with a baseline BiDAF model we make two extensions to improve the model. In the first extension we add character embeddings to match the model in the original BiDAF paper. Next we swap out the LSTM encoder for, the more parallelizable, Transformer block. After creating our word and character embeddings we add in positional encodings. Next we apply a single transformer encoder block featuring convolution and self attention to the embeddings of the context and the query. We then perform BiDirectional attention, before applying three more transformer blocks in the modeling layer. Finally we output a prediction of the answer or no answer if one does not exist.

Extended QA System on SQuAD 2.0

Our motivation is to build a Question Answering (QA) system that gives answers as specific and as accurate to queries, which is in itself an art but based on the science of Natural Language Processing (NLP). The main goal of our project is to produce a QA system that works well on SQuAD 2.0 dataset that performs better than the baseline Bidirectional Attention Flow (BiDAF) model. To better capture the context from a more expressive set of answers and understand the interactions between the question and the document, we utilized the coattention mechanism by encoding the two-way attention outputs together through a bidirectional reccurrent neural network (RNN). We experimented with enriching the embedding layer with concatenating character embeddings with existing word-level embedding, modifying the attention layer with coattention from Dynamic Coattention Networks (DCN), adding an Answer Pointer, which conditions the ending of the answer span on the starting position, to the output layer. Our best performing single model obtained F1/EM scores of 63.40/59.87, which both achieved better results than the baseline. Adding character embeddings and the answer pointer gave us a successful performance boost compared with the BiDAF baseline model. On the other hand, dynamic coattention from DCN did not beat the attention and modeling layer combined in the baseline BiDAF model but was worth trying. To further improve the performance of our model, we built ensemble models which finetune on the dropout rates, and the best one achieved F1/EM scores of 64.21/60.81.

Character Embedding and Self Attention Mechanism with SQuAD

In this project, we have demonstrated the effectiveness of character embedding. According to our experiment results, adding Context2Context self attention mechanism can not improve the performance of the BiDAF model. The BiDAF model with character embedding performs well with its Context2Query attention and Query2context attention. Adding self attention to this model will include additional interference when the context words attend not only to the query words, but the context words itself, which slightly reduced the model performance. For the future work, we can add additive attention to the BiDAF model to see how it compares to the two attention implementations we use. In addition, there are plenty of modern techniques, including Transformer and Reformer, can be further explored to find the best performing model on SQuAD challenge.

Domain Adaptive Adversarial Feature Disentanglement for Neural Question Answering

Learning-based Question Answering systems have achieved significant success with the help of large language models and pre-trained model weights. However, existing approaches assume that data is drawn i.i.d from the same distribution, which violate the more realistic scenario that test-time text and questions are under different distributions. Deep networks have been used to learn transferable representations for domain adaptation, which has shown success in various vision tasks. In this project, we study the problem of domain adaptive question answering leveraging various techniques, ranging from Data Augmentation, Layer Re-initialization and Domain Adversarial Alignment. Specifically, we propose to use a wasserstein-stablized adversarial domain alignment scheme on the distilBert backbone with last layer reinitialized, to train on both the data-rich in-domain QA datasets and data augmented out-of-domain (OOD) datasets, following a finetuning stage on data-augmented OOD datasets. We have conducted extensive experiments to demonstrate the effectiveness of our proposed method in bringing significant performance boost for the task of domain-adaptive Question Answering. We also conducted carefully-designed ablation studies to show the performance gain resulted from each of the proposed components. Our proposed model addresses the problem of domain-adaptive question answering from various perspectives, including data, model architecture, and training scheme. The evaluation results on the provided OOD validation datasets show that our proposed method is able to bring 8.56% performance improvement, compared to the vanilla baseline using DistilBert without any of such domain adaptive designs.

Data Augmentation for Robust QA System

In this project, we identify the trade-off between different data augmentation strategies for Robust QA System. For in-domain datasets, we need to sample the datasets first to avoid overfitting and then use more advanced data augmentation techniques, such as back-translation and abstract summary augmentation, to generate more diverge datasets in order to help the model learn the unseen data. For out-of-domain datasets, we need to use data augmentation technique that could generate similar datasets, such as spelling augmentation and synonym augmentation. Also, we need to iterate the data augmentation for multiple times in order to increase the proportion of out-of-domain datasets. The iteration number needs to be carefully designed because it may also slightly affect the final performance of the Robust QA System.

Improving the Performance of Previous QA Models

Question answering is a challenging problem that tests language processing models the ability to comprehend natural languages. In this project, we implemented two models, BiDAF and QANet, to solve the Stanford question answering dataset (SQuAD) 2.0. We experienced different methods to improve the performance of these models, including adding character embedding layers, data augmentation, and ensemble modeling. Finally, we compared the result across different experiments and gave an analysis of our models. In the end, our best model achieved F1/EM score of 68.71/65.38 in the test leaderboard.

Building a QA System using R-net

Question-answering task is an important problem for research in natural language processing, for which many deep learning models have been designed. Here we implement R-Net and evaluate its performance on SQuAD 2.0. While the performance of R-Net itself is worse than BiDAF, it showed a strong capability of its attention mechanism compared to BiDAF as shown in the image. We have also experimented with an ensemble model using BiDAF and R-Net that achieved better performance than the baseline BiDAF. Our study suggests that a promising future direction is to combine BiDAF and R-Net for building better models.

Robust QA System with Task-Adaptive Pretraining, Data Augmentation, and Hyperparameter Tuning

Despite their significant success, transformer-based models trained on massive amounts of text still lack robustness to out-of-distribution data. In this project, we aim to build a robust question answering system by improving the DistilBERT model. To accomplish this goal, we implement task-adaptive pretraining (TAPT), model tuning such as transformer block re-initialization and increasing the number of training epochs, and ensemble methods. We also use data augmentation techniques to enable the model to generalize well even with limited data in the domains of interest.

RobustQA: Adversarial Training with Hyperparameter Tuning

In this project, I used adversarial training and hyperparameter tuning to build a question answering system that can adapt to unseen domains with only a few training examples from the domain. From a high-level perspective, there are two model architectures: the baseline model provided by the starter code and my own adversarial model. To compare the performance of the two model architectures, I experiment with ADAM debiasing, various batch sizes, and weight decay tuning.

Multi-Task Learning and Domain-Specific Models to Improve Robustness of QA System

In CS224N course project, we develop a Robust Question Answering (QA) language model that works well on low resource out-of-domain (OOD) data from three domains. Our approach is to take the pre-trained DistilBERT model on high-resource in-domain dataset and then perform multi-task training. We implement multi-task training model that uses unlabeled text from OOD data for Masked Language Model Objective as well as labeled QA data from high-resource setting. The model jointly trains on unlabeled text and QA data to preserve the QA representation from high-resource data and adapt to low-resource OOD. We also explore data augmentation techniques such as synonym replacement, random word deletions and insertions, word swapping, and back-translation to expand our out-of-domain dataset. Finally, we use Domain-Specific Models to have separate models for different datasets and observe that we get the best result on different datasets using different strategies. As a result we achieved the score of 59.203 F1 and 42.362 EM on the test set, 54.41 F1 and 41.62 EM on the validation set.

Building a QA system (IID SQuAD track)

In order to improve our baseline model, we have experimented many approaches and methods. We have started by adding a "Character Embedding Layer", which allows us to condition on the internal morphology of words and better handle out-of-vocabulary words. Then we have focused on improving our attention layer by trying different approaches. We developed a "Co-Attention Flow Layer", which involves a second-level attention computation, attending over representations that are themselves attention outputs. Furthermore, we added a "Self-Matching-Attention" from the R-Net consisting on extracting evidence from the whole passage according to the current passage word and question information. Besides, we experimented an idea from the "QANet" by adapting ideas from the Transformer and applying them to question answering, doing away with RNNs and replacing them entirely with self-attention and convolution. Then, we tried a new idea consisting on adding another BiDAF layer, this layer accounts not only for the interactions between the context and question and for the ones within the context. We wanted some-how to account also for the Context-to-Context interaction, this is will provide valuable information about the co-dependence between different words in the context. To put this idea into practice we have added another BiDAF layer performing a self-attention process like the one between the context and the query. The input to this layer will be the representation we get from the first BiDAF attention layer and the words context representations we get from the first encoder. The output of this layer will successfully account not only for the interactions between the context and question and for the ones within the context. This is the model that provided the highest score. We have also being experimenting with additional gates and nonlinearities applied to the summary vector after the attention step. These gates and nonlinearities enable the model to focus on important parts of the attention vector for each word. Our devised model "Double BiDAF" achieved the best score of 63.03 on the validation set. This is exceptional because we have only made a small change to the model architecture and it yielded such improvement.

Robust QA with Task-Adaptive Pretraining

It is often hard to find a lot of labeled data to train a QA (question answering) model. One possible approach to overcome this challenge is to use TAPT (task-adaptive pretraining) in which the model is pretrained further using the unlabeled data from the task itself. We implement the TAPT technique to make a QA model perform robustly on a task with low-resource training data by first pertaining on the larger unlabeled data set. We then fine tune the model with a smaller labeled dataset. The results are mixed. Although a preliminary model that is pretrained on just the out-of-domain train data performed better than the baseline, additional pretraining using more out-of-domain data performed worse than expected.

Mixture of Experts and Back-Translation to improve QA robustness

This work improves the generalization of a DistilBERT-based Question Answering (QA) model with the addition of a Mixture of Experts (MoE) layer as well as through data augmentation via back-translation. QA models generally struggle to perform in contexts that differ from those present in the model's training data. As a step towards addressing this limitation, our MoE implementation effectively learns domain-invariant features without explicitly training each expert on individual subdomains. We also apply top-k sampling back-translation and introduce a new technique to more effectively retrieve the answer span from the back-translated context. We find that the addition of the MoE layer yields an improvement of 3.19 in F1 score on an out-of-domain validation set, with back-translation granting a further 1.75 in F1 score. This represents a net improvement of 10.1% over the DistilBERT baseline.

A Dynamic Chunk Reader with Character Level Embeddings for Question Answering

In 2016, Yu et. al. proposed an end-to-end neural reading comprehension model, know as a Dynamic Chunk Reader (DCR), for question answering. In this model they chose to input word embeddings as well as several other semantic and linguistic features such parts of speech and capitalization into their initial encoding layer. A natural follow-up to this is to experiment with different inputs to the encoding layer. One possibility is to input character embeddings in addition to the word embeddings. This paper describes a model that re-creates the DCR model from scratch and the creation of a character level embedding using CNNs to feed into the DCR model.

Robust QA System with xEDA: Final Report

We present xEDA: extended easy data augmentation techniques for boosting the robustness of question answering systems to shifts in data domains. xEDA extends existing data augmentation techniques by drawing inspirations from techniques in computer vision. We evaluate its performance on out-of-domain question answering tasks and show that xEDA can improve performance and robustness to domain shifts when a small subset of the out-of-domain data is available at train time. xEDA consists of masking, extended random deletion, extended random insertion, and simple extended random insertion. We discovered that xEDA can help build a question answering system that is robust to shifts in domain distributions if few samples of out-of-domain datasets are available at train time. In particular, by applying xEDA to out-of-domain datasets during training, we were able to increase the performance of our question answering system by 6.1% in terms of F1 and by 14.9% in terms of EM when compared to the provided baseline on the dev set. Moreover, using 40% of the out-of-domain train datasets augmented via xEDA achieved the same performance as using 100% of the out-of-domain train datasets. Our analysis also suggests that an augmented data of smaller size may lead to better performance than non-augmented data of larger size in some cases. Given the simplicity and wide applicability of xEDA, we hope that this paper motivates researchers and practitioners to explore data augmentation techniques in complex NLP tasks.

Robust QA on out of domain dataset over pretraining and fine tuning

We have seen tremendous progress on natural language understanding problems over the last few years. Meanwhile, we face issues that models learnt from a specific domain couldn't be easily generalized to a different domain. I explored different models to build robust question answering system that can be applied to out-of-domain datasets. Models explored are baseline with and without fine tuning, adding dataset prefix in question with and without fine tuning, switching question and context in question answering system with and without fine tuning, and shorter question and context in model input with and without fine tuning. Different fine tuning techniques like changing epochs, batch size and Adam optimization learning rate were explored to find the best model performance. The best model achieved 40.367 EM and 58.467 F1.

Recurrence, Transformers, and Beam Search - Oh My!

Question answering on the IID SQUAD 2.0 dataset is a proving ground for natural language processing systems. In this project, we explore recurrent and transformer-based architectures for SQuAD 2.0. We implement several improvements on the baseline BiDAF and the canonical transformer QANet. Our best model, BiDAF with character embeddings and beam search output, scores F1 62.291 and EM 59.493. Finally, we suggest further directions for research in self-attention and modeling/predicting NA answers.

BiDAF with Explicit Token Linguistic Features

How do you do reading comprehension? When I learned reading comprehension with English as my second language, I was taught a few tricks. One important trick is to find word correspondences between the text and the question. Another trick is to use information such as part of speech and sentiment of known words to infer meaning of other unknown words. In this project, I explore the effectiveness of those tricks when applied to SQuAD, by supplying BiDAF with explicit linguistic features from the tokenizer as part of the input. I found that although effective at improving the scores, using those features is prone to overfitting if not regulated.

Building a QA system (IID SQuAD track)

Question answering is an intriguing NLP task, as it provides a measurement for how well the model can understand the text and perform different kinds of logical reasoning. This project aims to build a question answering system based off BiDAF model that works well on Stanford Question Answering Dataset 2.0 (SQuAD 2.0). We examine the effect of character-level embedding, self-attention mechanism, answer-pointer, and transformer blocks. After model comparison and hyperparameter search, our best model with character-level embedding, self-attention, and GRU layers achieves an F1 Score of 63.408 and a EM Score of 60.456 on CS224N internal test set of SQuAD 2.0.

SQuAD 2.0 with BiDAF++ and QANet

In this project, we produced a question answering system on SQuAD 2.0. To enhance the task performance, we explored two kinds of models. One is baseline BiDAF model, we modified the baseline by adding character embeddings and implementing Co-Attention layers. We conducted the experiments thoroughly to evaluate the effects of each component. The other is QANet, which is a Transformer-based model, only including convolutional and self-attention layers and free of RNN component. We implemented the model from scratch and got some results during the experiments. We found our best result is from the BiDAF-related model and achieved F1 score 64.96, EM score 61.70 in validation set and F1 score 64.712, EM score 60.997 in test set.

Meta Learning with Data Augmentation for Robust Out-of-domain Question Answering

Natural language understanding problems has gain much popularity over the yearsand current models often has poor generalizability on the out-of-domain tasks. This robust question answering (QA) project aims to remedy this situation by using Reptile, a variant of meta learning algorithms. In this project, the primary goal is to implement Reptile algorithm for question answering tasks to achieve a better performance than the baseline model on the out-of-domain datasets. After the Reptile implementation is validated, the secondary goal of this project is to explore how various hyper parameters affect the final performance. After we believe that the Reptile is optimally tuned, we worked on the data provided for this project. First, we merged in-domain validation dataset to the training data, then we added data augmentation to further tap into the potential of the out-of-domain training data. Training results of Reptile outperforms vanilla BERT model and Reptile with data augmentation increases the score even further. The best F1 score is 59.985 and best EM score is 42.225. If we compare the performance on out-of-domain validation dataset, scores are more than 12% and 22% higher than the baseline score respectively.

BiDAF with Self-Attention for SQUAD 2.0

The primary goal of this work is to build a QA system that improves upon a baseline modified BiDAF model's performance on the SQuAD 2.0 dataset. To achieve this improvement, two approaches are explored. In the first one, the modified BiDAF model's embedding layer is extended with character-level embeddings. In the second approach, a self-attention layer is added on top of the existing BiDAF attention layer. The performance of these two approaches is evaluated separately and also when combined together into a single model. The model with character embeddings yielded the best performance on the test set, achieving an EM score of 56.872 and a F1 score of 60.652. The self-attention model performed below expectations overall, though it was the best model when it came to performance on unanswerable questions.

Better Learning with Lesser Data: Meta-Learning with DistiIBERT

While pre-trained transformer models have shown great success in recent years, it requires a large amount of task-specific data to finetune. In our project, we have experimented with the a variant of the MAML algorithm, namely Reptile, in a low resource QA program. In contrast to the normal training procedure, MAML algorithm trains the model with a double-loop structure. In the inner loop, the program goes through meta-batches, with T tasks in each. For each of the tasks in the inner-loop, a submodel is made and updates k times. After the k descents have been made for T submodels, they are collected and processed in the Metalearner Reptile, where the next descent on the meta-model is determined. From the first glance of this training protocol, it appears to be similar to the multi-task learning model, since they both expose the model to multiple tasks, which enables transfer learning. Aside from that, one major distinction of MAML is that it makes use of the k th gradients, which enables the SGD to access higher order terms in the loss function, thereby allowing the MAML algorithm to find a better initialization than the other methods and descends at much rapid rate in any tasks in the downstream, as shown in figure (1). Furthermore, the reason MAML can find better model initialization than multi-task learning is that it can avoid overfitting to any one task, which is known to be a tendency in multi-task learning. In the end of the study, we introduce a cost-to-improvement ratio, evaluating whether the additional accuracy gain in MAML can justify the increase in runtime. Despite there is a absolute gain in the accuracy by MAML, we express our reservation in regard to the comparative advantage of MAML, since this 1 point increase in the accuracy comes at large sacrifice of runtime.

Answer Pointer Inspired BiDAF And QANet For Machine Comprehension

Imagine that you are trying to find the answer for a question given a context paragraph. This kind of tasks fall into the category of one of the hottest topics in NLP - machine comprehension. With the help of emerging high-performance GPUs, deep learning for machine comprehension has progressed tremendously. RNN based methods, such as Match-LSTM and Bidirectional Attention Flow (BiDAF), and transformer-like methods, such as QANet, keep pushing the performance boundary of machine comprehension on the SQuAD datasets. Our team proposes to improve the performance of the baseline BiDAF and the QANet models on SQuAD 2.0. We replace the original output layer of BiDAF and QANet with Answer Pointer inspired output layers and add character level embedding and ReLU MLP fusion function to the baseline BiDAF model. We achieve significantly better performance using ensemble learning with majority voting on modified BiDAF, QANet1, and QANet3 models. Specifically, the ensemble learning achieves a F1 score of 66.219 and a EM score of 62.840 on the test datasets and a F1 score of 68.024 and a EM score of 64.561 on the validation datasets.

Robust Question Answering: Adversarial Learning

In the NLP task of question-answering, state-of-the-art models perform extraordinarily well, at human performance levels. However, these models tend to learn domain specific features from the training data, and consequently perform poorly on other domain test data. In order to mend this issue, we adopt the adversarial training approach to learn domain invariant features in existing QA models. In this approach, the QA model tries to learn hidden features that the discriminator, which tries to classify the domain of the question-answer embedding from the hidden features, unsure of its prediction, thereby learning domain-invariant features. The intuition is that if the QA model can confuse the discriminator, then the features it has learned are not easily attributable to a specific domain. The QA model's loss depends on its own errors in answer prediction (the QA loss) as well as how well the discriminator predicts domain (the adversarial loss). We study modifications this model, in particular the impact of weights on the adversarial loss on the model's performance. We also study other techniques such as data augmentation and answer re-ranking in order to make our model more robust. Our work is limited in that we only train models on a subset of the training data available to us due to the cost of training time. However, we can conclude that changing the weight of the adversarial model results in marginal changes in performance. Furthermore, although the adversarial model exhibits improvements over our baseline, data augmentation proves to be a more effective technique in making the model robust on our of domain data given the subsampled training data.

Importance Weighting for Robust QA

Machine Reading Comprehension (MRC) Questions Answering (QA) systems are commonly used within conversational agents and search engines to support users information needs while saving users the effort of navigation in documents, when the information need is a question for which the user seeks an answer. While state of the art approaches have shown to be successful for QA on a general domain, enterprise retrieval problems where the information need for QA exists in domains that are specialized and have limited or none annotated data remain open. In this work we address adaptation to new specialized domains with very little training data for MRC-QA, focusing on importance weighting. We propose two features for importance weighting that are applicable for an unsupervised setting, and present preliminary results comparing importance weighting with transfer learning.

Question-Answering with QANet for SQUAD 2.0

Our task for this project is to is to design a question-answering system for the SQuAD 2.0 dataset that improves upon the BiDAF baseline model. To do this, we experiment with QANet, a transformer-based architecture. We also reintroduce a character-level embeddings on top of the provided BiDAF model, as well as a self-attention layer. Our best QANet model achieved 61.47/64.81 EM/F1 scores on the test set.

Combining QANet and Retro-Reader Models

Our task is to design a machine reading comprehension (MRC) model that can accurately solve question answering problems from the Stanford Question Answering Dataset (SQuAD). For our model, we aimed to 1) implement the QANet model, which is one of the highest performing non-pretrained models, and 2) extend QANet with a verification module inspired by Zhang et al. (2020) to better identify unanswerable questions and improve performance on SQuAD 2.0. We explored variants on both the QANet architecture as well as the Retro-Reader Architecture experimenting with different values for hyperparameters and our best single model achieved an F1/EM score of 66.10/62.28 on the development set and 64.422/60.659 on the test set. We explored a variant on the Retro Reader architecture that involved training one model to always predict an answer and training a separate model that does all the answerability prediction. Despite not significantly improving the performance of the model, through our error analysis, we gained deep insights into what components degraded model performance and developed potential hypotheses for future improvements. In particular when testing the Retro QANet model, we discovered that the Intensive QANet model was prone to false negatives and false positives thus we hypothesize that the main shortcoming of our model is its reading comprehension ability. Overall, we explored the application of retro reader and verification techniques to one of the highest performing non-PCE models and experimented with parameters and the architecture.

QaN I have Your Attention? Exploring Attention in Question-Answering Model Architectures

In this project, we build non-pre-trained models for the question-answering task on the Stanford Question Answering (SQuAD) 2.0 dataset, exploring on the effect of attention on the result. We explore the performance of deep learning model architectures that utilize attention: BiDAF (context-query attention), Dynamic Co-Attention (second-level attention) and QANet (self-attention). We explored the baseline BiDAF model, and improved it through character embeddings and co-attention, as well as re-implemented QANet. We ensembled results, and obtained highest performance of F1 67.96, EM 64.41 for single model dev, F1 70.66, EM 67.87 for ensemble dev, and F1 68.39, EM 65.44 for ensemble test. We performed analysis on the single model and ensembles to better understand the model mechanisms and performance.

Robust Question Answering via In-domain Adversarial Training and Out-domain Data Augmentation

How can a Question Answering model trained on Wikipedia solve examination questions correctly? The cross-domain Question Answering is challenging since QA models are usually not robust to generalize well on out-of-domain datasets. We would like to explore the effectiveness of domain-related information on QA model robustness. We leverage potential domain information, both domain-specific and domain-invariant, from the text data. During training on the in-domain training set, we explore the adversarial training by experimenting on three adversarial functions. We add a domain classifier to distinguish different domains. Meanwhile, the QA model fools the domain discriminator to learn domain-invariant feature representations from the in-domain training set. In addition to the domain-invariant learning from the in-domain training, we also propose a data augmentation method that can retain high-level domain information by using named entity recognition and synonyms replacement. Out -of-domain datasets are insufficient and we want to utilize them most. This augmentation method is applied on the oo-domain training set and we suppose that it will let the model learn domain specific information from the out-of-domain datasets. To give better insights on our adversarial training and augmentation methods, we conducted several experiments and provide our analysis in this report.

Comparing Model Size and Attention Layer Design Impact on Question-Answer Tasks

In this project, we explore the use of various Neural Language Models applied to Question Answer tasks from the SQuAD dataset. We're specifically interested in exploring the transition from RNN-based models to transformer-based models. RNN Neural Language Models were dominant in language tasks for many years, but the introduction of the transformer demonstrated that the fall-backs of RNN models could be overcome by using architectures that optimize for larger, more parallelizable models. In this work, we compare the impacts of expanding model size with the impact of changing attention layer implementations using a Bi-Directional Attention Flow baseline model. We find that model size has a significantly greater impact on model performance on the SQuAD dataset, but larger models fail to improve performance on unanswerable question-answer examples.

Pretraining of Transformers on Question Answering without External Data

Can recent Transformer-based pretraining approaches still perform effectively on question answering without external data and large computational resources? We find that an ELECTRA-style MLM objective can significantly reduce the computational cost of pretraining, and the train-test discrepancy can be reduced by using a small vocabulary size and question augmentation. These methods can boost the F1 score of a Transformer model on the SQuAD 2.0 task from (far below) 52.2 to just over 60.4 on a development set. However, the Transformer model relies mostly on textual similarity between the question and context, rather than on language understanding, to predict answers. The model still performs worse than a baseline BiDAF model, suggesting that the ability of current state-of-the-art training objectives and model architectures to learn effectively from limited data is still severely lacking. We hope that future methods, even with a general model architecture and objective, are able to perform well in a low-resource setting, and that this should also lead to approaches that learn more quickly, effectively, and generally by learning patterns, rather than correlations, that capture the meaning of language

Data Augmentation: Can BERT Do The Work For You?

Data augmentation has been proved effective in analyzing a neural model's robustness and improving it by re-training with augmented data. Because text data's discrete feature space, most data augmentation techniques require querying multiple systems for language knowledge and meticulous augmentation rule design by researchers. This paper aims to explore the effectiveness of an automatic, black-box data augmentation method using language models, bert context rewriter, and to compare it with another augmentation algorithm, token reorderer, which uses Universal Sentence Encoder's semantic knowledge. Given a baseline question answering model, we employ DistilBERT masked language model (mlm) to rewrite masked context data and evaluate whether re-training with the augmented data can improve the robustness of the baseline model. This augmentation relies on the existing language knowledge learnt by DistilBERT mlm and does not use additional hand-crafted rules. We also explore how different configurations, including masked token percentage and additional mlm fine-tuning, affect our method's effectiveness. Preliminary experiments show that both our methods obtain improved performance on out-of-domain dev set over the baseline and reduce the performance gaps between in-domain and out-of-domain datasets. However, token reorderer's performance is consistently better than bert context rewriter's in both out-of-domain evaluation (+2.9 F1/+2.9 EM versus +1.9 F1/+1.6 EM) and reducing in-domain out-of-domain gaps (-5.3 F1/-4.8 EM versus -1.7 F1/-2.5 EM) and therefore is more effective in improving the baseline model's robustness.

RobustQA Using Data Augmentation

This project aims to explore possible improvements and extensions to the RobustQA Default baseline provided by the CS224N Winter quarter staff. Our goal is to create a domain-agnostic question answering system given DistilBERT as a pre-trained transformer model. The main method attempted in this paper is that of Task Adaptive Fine Tuning (TAPT), which entails a pre-training step utilizing the Masked Language Modeling task. This method was combined with experimentation on hyperparameters (batch size, number of epochs, and learning rate) to produce the highest-achieving model. Specifically, a pre-trained MLM model with a batch size of 32 yielded an EM of 42.75 and F1 of 61.14, which are each around 2 points higher than the baseline metrics.

More Explorations with Adversarial Training in Building Robust QA System

In real world Question Answering (QA) applications, a model is usually required to generalize to unseen domains. It was found that an Adversarial Training framework where a conventional QA model trained to deceive a domain predicting discriminator can help learn domain-invariant features that generalize better. In this work we explored more discriminator architectures. We showed that by using a single layer Transformer encoder as the discriminator and taking the whole last layer hidden states from the QA model, the system performs better than the originally proposed simple Multilayer Perceptron (MLP) discriminator taking only the hidden state at the [CLS] token of the BERT QA model.

Probability-Mixing: Semi-Supervised Learning in Question-Answering with Data Augmentation

The probability-Mixing method proposed in this project consists label guessing and label interpolation. : We need to prepare three types of data for this semi-supervised problem: labeled data, mixed data, and unlabeled data. If we have labeled data "The project is too hard". We first use GLoVE to find similar words and replace them with something like "That work is super difficult", which is our unlabeled data. Then for each word, we randomly select either from word from both data and have "That work is too difficult". Then we can linearly interpolate the labels for the mixed data for both mean square loss and cross-entropy loss. In this project, our experiments demonstrate that sequential order information does not necessarily help query-context matching, and excessive sequential order information in BiDAF's RNN can lead to overfitting. To alleviate overfitting and add more variety to the training samples, we propose four data augmentation methods without introducing non-negligible label noise, which improves the F1 scores of BiDAF and the QANet with 8 heads by at least 2 points. We also propose the Probability-Mixing method to prevent the model from memorizing the context, which significantly improves its ability in query-context matching. This method reduces the FPR from 0.3 to 0.18 and increases F1(TP) by 4 points for the QANet model, making it a much better model in preventing the generation of misleading information for the question-answering system.

Gated Self-Attention for SQuAD Question Answering

Machine comprehension and question answering are central questions in natural language processing, as they require modeling interactions between the passage and the question. In this paper, we build on the multi-stage hierarchical process BiDAF described in Seo et al. (2017)'s Bi-Directional Attention Flow for Machine Comprehension. We utilize tools from the R-Net model described in R-Net: Machine Reading Comprehension with Self-Matching Networks, testing different combinations of model components. We experiment with different types of encoding, such as using a Gated Recurrent Unit (GRU) or a Convolutional Neural Network (CNN), and attention mechanisms, such as comparing context-query attention layers and contemplating the usage of gates. We ultimately introduce a modified form of BiDAF which utilizes both an LSTM and a CNN in its encoding layer, as well as BiDAF's context-query attention layer followed by R-Net's self-attention layer. We conduct various experiments on the SQuAD datasets, yielding competitive results on the CS224N SQuAD Leaderboard.

The Unanswerable Gap: An Exploration of Approaches for Question Answering on SQuAD 2.0

In this project, we implemented models that were trained and evaluated using the Stanford Question Answering Dataset (SQuAD). For a majority of our models, we incorporated character-level embeddings in order to strengthen the system's understanding of the semantics and syntax of each context and question. Our implementations fall into two main categories: modifying the baseline Bidirectional Attention Flow (BiDAF) model and implementing the Dynamic Coattention Network from scratch. We found that the baseline BiDAF model with character-level embeddings performed the best and received an EM/F1 score of 61.771/65.089 on the test set.

Self-attention and convolution for question answering on SQuAD 2.0: revisiting QANet

QANet was the first Question Answering model that combined self-attention and convolution, without any use of Recurrent Neural Networks. Convinced by the "Attention is all you need" motto (or, more accurately in this context, the "You don't need RNNs" motto), we were naturally interested in seeing how this applies to the specific task of Question Answering. In this project, we therefore tackle the Question Answering task on the SQuAD 2.0 dataset using different variations of the QANet architecture. We first re-implement the QANet model, and then explore different versions of the architecture, tweaking some parameters such as attention mechanisms and model size. We then propose 3 ensemble models with different inference methods: our best model, using a novel two-step answerability prediction based inference method, achieves 71.21 F1/ 68.14 EM on the development set, and 69.04 F1 / 65.87 EM on the test set.

Exploration of Attention and Transformers for Question and Answering

The project was intended to be an exploration of convBERT model without pretraining, but after training a base BERT model (encoder only Transformer) and achieving very low performance, the objective shifted towards trying to understanding transformers and attention for Question Answering. Experiments on both hyperparameters and network architecture was done on the BERT model, with conclusion that this model will either overfit, or not converge. A hypothesis is suggested that without large corpus pretraining, simple self attention on a concatenated context and question has big difficiencies vs explicit cross attention to learn SQuAD. QAnet model was also trained for purposes of comparisons and analysis.

Building a Robust QA System

The robustness to domain shifts is very important for NLP, as in real world, test data are rarely IID with training data. This NLP task is to explore a Question Answering system that is robust to unseen domains with few training samples. In this task, three out-of-domain datasets show very different characteristics and they are trained with different in-domain datasets which are more beneficial for their challenges. Multiple transfer learning models are mixed in different ways: mixture of logits, mixture with custom output, and mixture with more features. Three majority vote strategies were taken to ensemble the models.

Attention-aware attention (A^3): combining coattention and self-attention for question answering

Attention has been one of the biggest recent breakthrough in NLP, paving the way for the improvement of state-of-art models in many tasks. In question answering, it has been successfully applied under many forms, especially with recurrent models (encoder-decoder fashion). Co-attention and multihead self-attention have been two interesting attention variations, but a larger study trying to combine them has never been conducted to the best of our knowledge. Hence, the purpose of this paper is to experiment different attention-based architecture types for question answering, as variations from one of the first successful recurrent encoder-decoder models for this task: BiDAF. We implement a variation of the attention layer, starting with a multi-head self-attention mechanism, on both the query and the context tokens separately, as provided by the encoder layer. Then, these contextualized tokens, added to the input tokens through a skip connection, are passed to a trilinear cross-attention and used to compute two matrices: a context to query matrix and a context to query to context matrix. These two matrices are concatenated with the self-attended context tokens into an output matrix. In addition, we provide our model a character embedding, which proves to have an important positive impact on the performance, as well as a conditional output layer. We test the performance of our model on the Stanford Question Answering Dataset 2.0 and achieved a performance of EM = 62.730 and F1 = 66.283 on the dev set, and EM = 60.490 and F1 = 64.081 on the test set. This provides +7.26 EM score and +6.95 F1 score compared to our coattention baseline, and +4.72 EM score and +4.97 F1 score compared to our BiDAF baseline.

Improve DistilIBERT-based Question Answering model performance on out-of-domain datasets by Mixing Right Experts

In this work, we built a MOE model by mixing 7 DistilBERT-based QA expert models that are task-fine-tuned on in-domain training datasets. We built data insight by carefully examining performance correlation across in-domain datasets and out-of-domain datasets and found out domain-fine-tuning on small target out-of-domain dataset that has quite different distribution than in-domain training dataset does not necessarily translate into out-of-domain performance on target dataset. We carefully select a set expert models for each out-of-domain set by leveraging data insights aforementioned. We achieved F1 score of 61.7} (ranked 6th out of 74 in test leaderboard) and EM score of 44.4 (ranked 2nd out of 74 in test leaderboard) in out-of-domain test datasets as of March 19, 2021.

Improved Robustness in Question-Answering via Multiple Techniques

Question-answering models are one of the most promising research areas in NLP. There has already been much study and development on how to accurately search for the correct answer of unanswered questions when the question is in the training domain. The usage of pretrained language models, such as ELMo, GPT, BERT, etc., enables knowledge gained from pretraining to be transferred to the new model. Although some models might outperform human performance for in-domain data, when it comes to out-of-domain data, they have poor performances. For the real world applications, we want a QA model to be able to both cover various domains and generalizes well on the out-of-domain data, hence the idea of domain generalization is proposed. For most of the current QA models, additional data is required to learn the new domains, and models tend to overfit on specific domains. Since it's impossible for a QA model to train on all domains, it's crucial to apply different techniques to build domain-agnostic QA models that can learn domain-invariant features instead of focusing on the specific features when there's limited training data. In this project, we are given three types of in-domain datasets: SQuAD, Natural Questions, NewsQA, and three types of out-of-domain datasets: DuoRC, RACE, RelationExtraction. By reading through papers, we've learned different techniques, such as adversarial training, data augmentation, task-adaptive pretraining, etc., that might help with domain generalization. We first apply them individually on the given baseline model, DistilBERT, compare their F1 and EM scores, and analyze their performances. Then, we apply several combinations of the techniques, and further explore the performances of their combinations. Eventually, the task-adaptive pretraining model gives us the best result, an increase of 2.46 in F1 score, and an increase of 3.92 in EM score compared to the baseline.

QANet on SQUAD 2.0

Project summaries unavailable

Multi-Phase Adaptive Pretraining on DistilBERT for Compact Domain Adaptation

While modern natural language models such as transformers have made significant leaps in performance relative to their predecessors, the fact that they are so large usually means that they learn small correlations that do not improve the model's predictive power. As a result, such models fail to generalize to other data, thus hampering performance in real-world cases where data is not independently and identically distributed (IID). Luckily, the use of domain-adaptive pretraining (DAPT), which involves pretraining on unlabeled target domain data, and task-adaptive pretraining (TAPT), which entails pretraining on all of the unlabeled data of a given task, can dramatically improve performance on large models like RoBERTa when the original and target domain distributions have a small amount of overlap. Consistent with the Robust QA track of the default project, this report investigates and tests the hypothesis that TAPT in tandem with DAPT (also known as multi-phase adaptive pretraining, or MAPT) can improve performance on the target domain for smaller transformers like DistilBERT on the question answering task, especially in the presence of domain shift. The final results show that the use of TAPT can lead to a slight increase in Exact Match (EM) performance without DAPT. However, implementing DAPT, even with the use of word-substitution data augmentation, significantly degrades the performance of the model on the held-out target domain dataset.

QANet+: Improving QANet for Question Answering

In this work, we build a question answering (QA) system and apply it on the Stanford Question Answering Dataset, version 2.0. Our goal is to achieve strong performance on this task without using pre-trained language models. Our primary contribution is a highly performant implementation of the QANet model. Additionally, we experiment with various modifications to this architecture. Most notably, we show that modifying the output layer, such that answer span's ending position prediction is a function of the starting position prediction, yields significant improvements over the original design. Using a QANet ensemble, we reach an F1 score of 71.87 and an EM score of 68.89 on an unseen test set (rank #1 out of 100+ submissions to the test leaderboard for the IID SQuAD Track of CS 224N at Stanford, Winter 2021).

Robust Question Answering System

Pretrained models like BERT achieves good performance when we fine-tune it to resourceful QA tasks like SQuAD. However, when we apply the model to out-of-domain QA tasks with different question and passage sources, the performance degraded badly. We discovered that the domain change in passage source is the main contributor to worse performance. We investigated ways to improve robustness of pretrained QA systems by experimenting on different optimizers, freezing and re-initializing model layers during training. We found that AdamW is the best optimizer for training on out-of-domain QA datasets, and freezing just the embedding block of DistilBERT improves model performance the most.

Augmenting BiDAF with Per-Token Features

The DrQA document reader showed that adding per-token features (e.g. part-of speech and named entity recognition tags) to a question answering model significantly improves performance on the SQuAD benchmark. I add six features to a baseline BiDAF model and explore the benefit of applying attention to not only LSTM hidden state, but also these per-token features. I verify the benefit of applying self-attention to these features and find that the augmented model significantly improves upon the baseline in terms of metrics and train time. My best model achieves a test score of (62.06 EM, 64.89 F1) compared to a baseline of (59.33, 62.09), reaching an optimal model in half the training steps.

Improving Robustness of Question-Answering System Using Domain-adaptive Pretraining, Adversarial Training, Data Augmentation and Finetuning

From previous work, we know that Question-Answering (QA) system based on neural language models (NLM) is highly sensitive to the knowledge domain of training data and often has inferior performance when used for out-of-domain QA tasks. In this project, the authors attempt to combine a few published methods to improve the robustness of the QA system on out-of-domain data. We have tried methods including domain adversarial training, domain adaptive pretraining, finetuning on few samples, and data augmentation. We applied these methods through experimentation, improving the robustness of our baseline model on out-of-domain test datasets given two groups of training datasets: three large in-domain datasets and three very small out-of-domain datasets. We experimented and evaluated the effects of the above-mentioned methods both individually and combined, and found that while the individual method generates mixed results, the combination of them can improve the robustness of the baseline model in the QA task to the greatest extent on the out-of-domain datasets. We have also included a qualitative analysis of our results, shedding some light on the real capabilities of our model.