Aspect-Target Sentiment Classification for Cyberbullying Detection

img
Cyberbullying detection is a challenging task to tackle, given the complex nature of the problem and the lack of Natural Language Processing (NLP) literature when it comes to addressing this issue. For a piece of text to be considered as cyberbullying, it not only has to be associated with a negative sentiment, but must also be targeted. This motivates the use of Aspect-Target Sentiment Classification (ATSC), which evaluates the sentiment of a given piece with respect to an aspect-target in the text. In particular, we make use of the BERT-ADA transformer architecture, fine-tuned on the hatespeech-twitter dataset, to demonstrate its superior ability in detecting cyberbullying in comparison to other state-of-the-art sentiment analysis baselines. Additionally, we make use of Named Entity Recognition (NER) in order to extract aspect-targets from tweets that do not explicitly "@" username handles of other users. The code is available on GitHub: https://github.com/sharanramjee/cyberbullying-atsc

A Neural Model for Text Segmentation

img
Text segmentation is the task of dividing a document of text into coherent and semantically meaningful segments which are contiguous. This task is important for other Natural Language Processing (NLP) applications like summarization, context understanding, and question-answering. The goal of this project is to successfully implement a text segmentation algorithm. We take a supervised learning approach to text segmentation and propose a neural model for this task. We aim to extend this task to podcasts by using existing transcription services. Our model obtained a Pk score (described below) of 6.54 on the Wiki-50 dataset, which was an improvement over our baseline score of 69.23. We experimented with self-attention as a modification to our model.

Consistent Estimation of the Average Treatment Effect with Text as Confounder

img
I show that embedding representations of text can be used to construct a root-n consistent estimator of the average treatment effect under confounding. I explore using both GloVE embeddings as well as transformer-based document embeddings to integrate text data in the double machine learning framework for causal inference. Using a large dataset of consumer complaints from 2018-2021 published by the CFPB, I estimate the causal effect of a complainant identifying themselves as an older American on the probability their complaint is resolved with monetary or non-monetary compensation. I show that including a representation of text reduces the treatment effect estimate.

Music Genre Classification using Song Lyrics

img
In this project, we aim to classify songs into genres using their lyrics. It is challenging for humans to perform this task, and there often is debate where a song fits since boundaries are not clearly defined and genres are overlapping. After preprocessing our data, we trained our own GloVe embeddings of the song lyrics and created different visualizations to better understand our data. As a baseline, we used our GloVe embeddings in two logistic regression models to classify them into genres. Then, we balanced our dataset so that there was a very similar number of lyrics for each of the genres. Finally, using our GloVe embeddings, we trained an LSTM model and bidirectional LSTM model. Our best LSTM model achieved an accuracy of 68%.

Cyclical Pre-Training for Cryptocurrency Price Predictions

img
Is it possible to use recent news headlines to predict the price of extremely volatile cryptocurrency prices in the future? The answer was found to be yes with the best results being 5% above chance. How was this done? Initial baseline approaches with BERT failed to have an accuracy more than flipping a coin. Instead, using cyclical pre-training with GPT and simplified questions answering proved to be the best strategy. The idea was that the model would hold onto pertinent price information between pre-training on a day's worth of news information and fine-tuning on past price questions. What does this do for the field of NLP? This work is related to sentiment analysis with a temporal aspect, which has applications beyond the financial sector such as election prediction and potentially early disease outbreak detection. This strategy may be helpful for certain models to improve their ability with numeracy and cause and effect as well.

Predicting Doctor's Impression For Radiology Reports with Abstractive Text Summarization

img
Predicting doctor's impression (summarization) for radiology reports saves doctors and patients tremendous time from manually digging through the reports. But there are few pre-trained language models for summarization, especially for radiology datasets. We solve abstractive summarization for the free-text radiology reports in the MIMIC-CXR dataset by building ClinicalBioBERTSum, which incorporates domain-specific BERT-based models into the state-of-the-art BERTSum architecture. We give a well-rounded evaluation of our model performance utilizing both word-matching based metrics and semantic based metrics. Our best-performing model obtains a ROUGE-L F1 score of 57.37/100 and a ClinicalBioBERTScore of 0.55/1.00. With comprehensive experiments, we showcase that domain-specific pre-trained and fine-tuned encoders and sentence-aware embeddings could significantly boost the performance of abstractive summarization for radiology reports. Our work also provides a set of pre-trained transformer weights that could further facilitate practitioner's future research with radiology reports.

Zero-Shot Cross-Lingual Discrete Reasoning

img
Discrete reasoning, including addition, subtraction, counting, sorting etc., remains a challenging task of machine reading comprehension. In addition, lack of parallel MRC data in languages other than English leads to increasing research interest on cross-lingual transfer learning. In light of studies from both sides, we tackle the task of zero-shot cross-lingual discrete reasoning using DROP data set and their manual translations in German and Chinese languages, and show that 1) multilingual BERT model can be configured to solve discrete reasoning tasks, and 2) the knowledge of discrete reasoning can be transferred cross-lingually in German and Chinese languages to certain extent, even without any available parallel training data.

SOTA Attention Mechanism and Activation Functions on XLNet

img
We re-implemented XLNet, a state-of-the-art transformer model from scratch, and experimented with SOTA activation functions including GELU and Mish, as well as Attention on Attention (AoA) mechanism. We analyzed the effect of the above techniques on our XLNet model evaluated on its pretraining behavior. We found that Mish improves model training by smoothing out the learning curve, and that AoA improves model performance by building a strong relationship between query and the traditional attention vector. We then further implemented such building blocks on the original XLNet model to see if the positive effects generalize to larger XLNet model. We pretrained, finetuned and evaluated the model on SQuAD 2.0, and concluded that Mish and AoA benefits XLNet's performance, especially when computing power is limited.

Understanding Gender-coded Wording in Job Postings with Word-vectors and BERT

img
Biased gender-coded words still exist in job advertisements today. These words can strongly influence candidates' perception of the job, discourage diverse candidates from applying and even reduce their sense of belonging to the occupation. Gaucher et al (2011)\cite{gaucher} provides two lists of known gender-coded words in job advertisements. One for masculine and one for feminine coded words. However, these lists are likely incomplete and may miss more subtle or sentence-level gender coding. In this paper I propose that by using word vectors and BERT, we can discover additional gender-coded words and detect gender bias at the grammatical/sentence level.

Wall Street vs r/wallstreetbets: Exploring the Predictive Power of Retail Investors on Equity Prices

img
The COVID-19 pandemic has accelerated the rapid growth of retail investors as legitimate market participants. In a David v Goliath showdown, retail investors, organizing via the subreddit "wallstreetbets" helped orchestrated a short squeeze of Gamestop stock, leading to a meteoric rise in the equity's value and a massive loss amongst some notable hedge funds. This research explores whether 'chatter' on that subreddit provides any predictive ability on a stock's price the following day. Specifically, performance on Gamestop and Tesla stock, as they are two of the most heavily discussed stocks on the subreddit. Our top models perform at roughly 52 and 58 percent accuracy, moderately improving upon our 'random' baseline of ~50 percent.

Peripheral Artery Disease Prediction using Medical Notes

img
In this project, we pursue to develop a BERT fine-tuned model on medical notes to predict Peripheral Artery Disease (PAD). PAD, or atherosclerotic occlusive disease of the lower extremities, affects 8-12 million American adults and more than 200 million worldwide. The prevalence of PAD is as high as 12-30% in patients over the age of 65 years and annual Medicare expenditures related to the treatment of PAD alone total $4 billion. PAD is a highly morbid condition that can lead to limb loss secondary to acute or chronically progressive lower extremity ischemia. Moreover, PAD can lead to a 6-fold increased risk of premature mortality and major adverse cardiovascular and cerebrovascular events (MACCE) . To date, machine learning algorithms have been applied to EHR data such as logistic regression, random forest, in classification of PAD. We investigate if deep learning produces a more accurate classification of PAD than standard machine learning algorithms.

Dissecting Language's Effect on Bottleneck Models

img
Language has been shown to be helpful on improving models' ability to generalize on unseen abstract concepts. However, it's still unclear why and how language can help the model to generalize. Inspired by the results in Learning with Latent Language (L3), we aim to answer several crucial questions about language's role on facilitating learning and improve L3's performance on few-shot classification where models needs to learn how to quickly adapt to unseen tasks by learning from a set of similar tasks. We first demonstrate that the accurate descriptions of the spatial relationships can massively improve the models' performance on few-shot classification by providing correct guidance encoded in natural language. To improve model's performance, we focus on two directions: 1) enhancing model's visual reasoning via providing more informative language guidance on spatial relationships, 2) enhancing model's ability to fuse different modality. Our results demonstrated that 1) we can achieve comparable classification performance by using a simple concept retrieval mechanism 2) a larger image model can dramatically improve the classification accuracy.

Model Compression for Chinese-English Neural Machine Translation

img
State-of-the-art neural machine translation (NMT) models require large amounts of compute and storage resources, with some of the smallest NMT models clocking in at several hundred megabytes. This large size makes it difficult to host NMT models in resource-constrained environments like edge and mobile devices, requiring that the user utilize either a stable internet connection or offline dictionary. Our goal was to compress a pre-trained NMT model as small as possible while minimizing reduction in translation accuracy. Using a pre-trained Chinese-to-English MarianMT model, Opus-MT, we tested several size reduction techniques and observed their impact on memory size, processing speed, and BLEU. We found that the combination of embedding compression and layer quantization achieved significant levels of model compression (56x!) with only a 3% drop in our translation accuracy. The implementation has a high reward-to-effort ratio, and can be applied to any pre-trained NMT model.

Conversational and Image Recognition Chatbot

img
This project proposes a chatbot framework that adopts a model which consists of natural language processing and image recognition technology. Based on this chatbot framework, neural encoder-decoder model is utilized with Late Fusion encoder and 2 different decoders(generative and discriminate). We are using Encoder-decoder CNN architecture for fusion of images and Resnet[15] architecture for object detection and localization. Localization of object or persons is done using Mask-RCNN model which not only localizes the object, but also provides a mask for localized object. We are utilizing COCO dataset to train the data, images are fused together to get a combined more informative output for detection of a doubtful presence. On training the complete Encoder-Decoder Network with Self-Attention stabilized the training a lot and decreased the loss further and improved the performance on the used metric on the validation set. The chatbot is able to detect object in the image, tell about and recognize the image, later on the chatbot is also able to answer the questions about this image. Integrated with self attention model to better the performance. The basic workflow is that given an Image (I), current question (Q) and a history of Question and answers (H), the agent should be able to generate the answer of the current question. The purpose of this project is to utilize natural language processing and computer vision models for efficient identify and answer following question to any images and following up questions, this could apply further in organizations, schools, hospitals and military areas. I believe this area has huge impact in natural language processing and visual recognition in industry or academic.

Government Document Classification

img
Steve Ballmer, after leaving Microsoft as its second CEO, pursued several new projects that reflected his more personal interests. One of these was founding USAFacts, a non-profit organization seeking to provide "a data-driven portrait of the American population and government's impact on society". One issue that USAFacts has sought to explore is the analysis of legislative actions, or bills. Thousands of bills are put up to proposal each year across these bodies. In the last Congress (the 116th), more than 14,000 bills were introduced. For the past two years, the USAFacts team has manually created a dataset detailing the counts of legislative actions that our government is taking to address a variety of different topical areas, such as healthcare, immigration, energy and environment. This has been a very labor-intensive and costly process, and the team is ultimately limited in the scope of documents that they are able to categorize. In this project, I looked for an effective way to classify legislative documents with a trained NLP model. I found that a fastText model was able to achieve an accuracy of 84% on a custom-scraped dataset of 10,000 documents, beating out other models such as a hierarchical attention network and convolutional neural network.

Sentence-BERT for Interpretable Topic Modeling in Web Browsing Data

img
Nowadays, much intellectual exploration happens through a web browser. However, the breadcrumbs that trail all this activity are largely unstructured. Common browsers retain lists of browsing history, which is typically timestamped, however, the tabular nature of this data existing as a list of URLs, titles, and timestamps leaves it much neglected and difficult to semantically explore. To overcome this challenge, topic modeling and document clustering are techniques used to manipulate and search collections of text for information retrieval and potential knowledge discovery. In this work, I leverage Sentence-BERT (SBERT) to build expressive embeddings for building an interpretable space for topic modeling within my own browsing history data. After qualitative analysis, topic clusterings made from SBERT web page embeddings outperform those made from Doc2Vec-based document embeddings. This method shows promise as a tool for semantically exploring one's browsing data history, or more broadly, other diverse collections of documents and text.

Comparing Task-Specific Ensembles to General Language Models in Distinguishing Elaborated and Compact Language Codes within CoCA-CoLA

img
Project summaries unavailable

Abstractive Summarization of Long Medical Documents with Transformers

img
Summarizing long documents has proven itself a difficult NLP task to perform using current transformer architectures. Transformers have context windows which limit them to processing only short to mid-length sequences of text. For our project, we employed a multi-step method for long document summarization. First, an extractive summarizer extracts key sentences from the original long text, and then an abstractive summarizer summarizes the extracted sentences. This allows us to work around the context window limitation and condition our final summary on text from throughout the whole document. We expanded upon previous implementations of this method by leveraging Transformers for both the extractive and abstractive steps. In particular, we show that our model quantitatively improves performance in the extractive step, and qualitatively provides more context and readability to the abstractive step.

Learning Links Between Low Level and Preferred Medical Terminology with GNNs

img
Relational databases like MedDRA can be valuable tools for learning technical medical language. In my project, I was interested in leveraging the tree structure of the MedDRA ontology to learn node embeddings for the natural language phrases contained at each level of the ontology. I begin by inferring a graph structure from the hierarchical relations defined between phrases in MedDRA. I then use a Graph Neural Network (GNN) to learn node embeddings which can be used to predict relationships between "low level terminology" - i.e. the kind of phrases used when doctors are talking to patients - and "preferred terminology" - i.e. standardized technical medical jargon. In this project I explored the effects of modeling the ontology using different graph structures (both homogeneous and heterogeneous). I further explore the effectiveness of pairing the GNN with various Bert models. My clearest finding is that the use of a heterogeneous GNN significantly outperforms a standard GNN in all experimental settings. I found that the heterogeneous GNNs are, on average, able to achieve approximately 70% accuracy in predicting the links between low level terminology and the appropriate preferred terminology. Additionally, and somewhat surprisingly, I found that using pretrained BERT models - either specialized medical BERT models like BlueBERT, or the standard base BERT model - to initialize note embeddings did not noticeably outperform the case of random node feature initialization.

iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability

img
Did the dog jump because the girl threw the frisbee? Or are the two events unrelated? Ask a 10-year old this question and note how easy it is for them to answer. Why you ask? Humans have a pretty good sense of causality, which is the science of understanding the cause and effect among events. Can we impart this knowledge to AI models? Can we get them to understand the commonsense reasoning behind causal relationships that humans find so easy to reason about? Could AI models use this knowledge to get better at certain tasks? Our work seeks to answer just that! While recent models tackle the task of mining causal data from either the visual or textual modality, there does not exist widespread prevalent research that mines causal relationships by juxtaposing the visual and language modalities. Under the visual modality, images offer a rich and easy-to-process resource for us to mine causality knowledge from, but videos are denser and consist of naturally time-ordered events. Also, textual information offers details that could be implicit in videos. Enter iReason, a framework that infers commonsense knowledge using two of the most important modalities humans use to be cognizant of the world around them -- videos and language. By blending causal relationships with the input features to an existing model that performs visual cognition tasks (such as scene understanding, video captioning, video question-answering, etc.), better performance can be achieved owing to the insight causal relationships bring about. Furthermore, iReason's architecture integrates a causal rationalization module to aid the process of interpretability, error analysis and bias detection. Using a two-pronged comparative analysis comprising of language representation learning models (BERT, GPT-2) as well as current multimodal causality models, we demonstrate that iReason outperforms the state-of-the-art.

CASCADE + BERT: Using Context Embeddings and Transformers to Predict Sarcasm

img
Sarcasm is a form of verbal irony, in which a writer or speaker states the opposite of their intended message in order to mock or show contempt. Sarcasm is commonly used online, and being able to detect sarcasm is crucial to understand and classify online pieces of text. In this work, we attempt to classify Reddit comments (from the SARC corpus) as either sarcastic or sincere. Sarcasm is heavily reliant on context -- information about the world or a text author. Therefore, we propose a BERT model, augmented with three different context types: discourse context, user context, and community context. After testing this model, we found that the addition of user context shows increased performance compared to training without user context. However, incorporating community context did not improve performance. We were able to achieve a maximum accuracy of 75.8 percent.

Extracting, and not extracting, knowledge from language models for fact-checking

img
Fact-checking is a challenging and useful classification task where a model evaluates the truthfulness of a natural-language claim. Researchers have taken a variety of approaches to building automated fact-checking systems, but recent work has introduced a paradigm that queries a language model to extract factual knowledge. We implement and evaluate several extensions to that pipeline, including alternate strategies for masking the claim and selecting the tokens that the language model predicts for masks. Evaluating these alternatives on a well-known fact-checking dataset, we find that they find that they have minimal impact on overall performance. Motivated by this finding, we construct a drastically simplified version of the pipeline - removing the language model - and find that its accuracy changes little. While its performance remains below state-of-art, these surprising results highlight difficulties in extracting knowledge from language models and introduce a new (to our knowledge) kind of entailment-based fact-checking baseline that involves no language model, corpus or knowledge base.

Enhancing Cherokee-English Translation System

img
Cherokee is an extremely low-resource language, which means that there is little parallel Cherokee-English parallel data to analyze. This presents difficulties of finding data-efficient models and useful data augmentation methods for performant machine translation. We propose using transfer learning with neural machine translation models with Inuktitut, a language with similar properties to Cherokee, to enhance BLEU scores. We attempt to augment a baseline NMT system by comparing subword-level and character-level embeddings, hyperparameter tuning vocabulary sizes and number of iterations per epoch for parent model in transfer learning, and data augmenting through copied monolingual data. In aggregate, we find a total improvement of 0.96 BLEU. When looking at the model’s performance over different sentence lengths, we find there is no relationship between sentence length and BLEU score.

Carl: An Empathetic Chatbot

img
Project summaries unavailable

Assessment of Neural Machine Translation Performance Based on a new Sentence Embedding Cosine Similarity Metric

img
Project summaries unavailable

NLP for Stock Market Prediction with Reddit Data

img
Reddit and the WallStreetBet subreddit has become a very hot topic on the capital market since the beginning of 2021. The discussions on these forums show the potential to influence the stock market. My project is to build a model to forecast the market movement based on the rich text data from Reddit. Specifically, I have explored sentence embedding, document embedding, CNN-based model, and sentiment analysis methods to leverage the sentence of posts & comments information for market forecasting. This project has tested and compared several types of model architectures. So far, the performance shows that the model could slightly improve performance from the naive forecasting method.

Exploring Knowledge Transfer in Clinical Natural Language Processing Tasks

img
Clinical domain NLP tasks generally involve information extraction, text classification, and text summary from clinical notes or electronic health records. However, due to the limited data resource, many NLP tasks in clinical domain have not been extensively studied as of the general domain NLP tasks. As transfer learning has achieved great success in many NLP applications and it especially benefits in applications where there is shared knowledge between the sub-tasks while data is limited, this project aims to gain insights on knowledge transfer in multiple clinical NLP tasks and analyze the impact of joint learning on the performance of individual task. Our main contributions are twofold: (1) we train a multi-task model based on clinical-BERT on varieties of NLP tasks including named-entity recognition (NER), sentence entailment, and text classification, and analyze the performance of the model with different tasks settings; (2) we train a NER model with different entity label annotations and investigate whether there exists knowledge transfers between different entity labels within the same dataset. Our results show that multi-task model achieves improved result on tasks that have shared knowledge (e.g., same task type or similar data distribution) and adding different entity annotations can benefit model performance on named-entity extraction. These fundamental findings shed lights on how to utilize transfer learning to improve clinical domain NLP applications.

Selectively Editable Language Models

img
Pretrained language models abound in NLP applications, but they are largely static in time and retain stale representations of the world around them. We seek to selectively edit knowledge learned by a language model without affecting its outputs on unrelated samples. This project explores approaches that alter model understanding of named entities through novel training techniques applied to DistilGPT2, a pretrained language model with 82M parameters. We build on methods first developed in general Model-Agnostic Meta-Learning (MAML) frameworks, which allow us to train model parameters on base language model objectives as well as a secondary "adaptability" task. Our results show that this technique improves knowledge editing with less performance degradation on unrelated samples than standard fine-tuning approaches.

Improving Medical Knowledge in the Automated Chest Radiograph Report

img
The clinical writing of unstructured reports from chest radiograph imaging is error prone, due to the lack of its standardization and repeated report writing daily, which can prove fatal. A system that generates reports can assist clinicians to reduce errors. Such a system could further be used as a training tool for medical education as well as used in the global setting to promote medical accessibility in low resource areas. Current medical report generating efforts employ the BLEU metric which was shown to score better clinically meaningless, yet grammatical, random reports. Further, state of the art methods for this task pretrain the visual extractor on Imagenet which has been shown to generalize poorly for medical domain applications. We seek to study the benefit of pretraining on a chest radiograph specific trained visual extractor. We also combine both feature extractors to study how this extra input information to the generating model can improve the semantic medical accuracy of the resulting reports. We test each model by evaluating BLEU(1-4) metrics and F1 score performance on each of 14 possible labels as defined by CheXpert for chest radiographs. We find that while the chest radiograph feature extrator model and the double feature model result in lower BLEU scores, they perform better across specific F1 scores and total F1 score. We provide evidence to suggest that the choice of Imagenet, domain specific, or combined feature extractor is dependent specifically on which medical knowledge is most important for the application. This supports the further investigation of using a combined domain specific feature extractor with an Imagenet pretrained feature extractor for medical imaging captioning tasks.

Fine-tuning of Transformer Models for High Quality Screenplay Generation

img
Screenplays contain semantically and structurally rich text as the average movie screenplay is thousands of words (tokens) long and contains long range dependencies of entity relations and contextual plot elements throughout. Large-scale pre-trained language models (like GPT-2) perform very well in open-domain text generation when the generated outputs are only ten to low hundreds of tokens long. This project aims to test how well current large transformer models perform at producing long, coherent texts for the task of movie screenplay generation. We compared the outputs of several different models such as GPT-2, GPT-2 finetuned for 1 epoch, GPT-2 finteuned for 3-epochs, and a recently published non-monotonic, progressive generation approach (ProGeT) to see which model and architecture could best support high quality screenplay generation. Generated screenplays were evaluated using traditional n-gram based statistical similarity scores (BLEU, MS-Jaccard, TF-IDF Distance, Frechet BERT Distance), a context embedding based similarity metric called BERTScore, and human evaluation. We found that non-monotonic generation approach performed best on a set of automated evaluation metrics, including BERTScore. Analyzing all model outputs, we see that the ProGeT model produces outputs that read most similarly to human written screenplays.

ALBERT: Domain Specific Pretraining on Alternative Social Media to Improve Hate Speech Classification

img
Fringe online communities like 4chan and Parler are havens of hate speech and tend to develop unique vocabularies that are challenging for hate speech classifiers to decode. These vitriolic environments normalize hateful dialogue proven to elicit real world violence, making hate speech classification an important issue for both online and offline safety. We aim to perform hate speech classification on three domain specific hate speech datasets from Twitter, Reddit, and Gab. Our aim is to improve results of hate speech classification within these fringe communities by using transfer learning by pretraining BERT models on domain specific corpora from Parler and 4chan. We build off related works by using the BERT base uncased model as our baseline. We contribute to these works by pretraining the BERT model on larger corpora and corpora of supposedly similar domains to the finetuning datasets to explore where improvements are possible. We also modified and evaluated the performance of the domain specific exBERT model. The Parler and 4chan models showed accuracy improvements over the baseline in two of the three hate speech datasets (Gab and Reddit). Importantly, improvements were observed in datasets of a similar domain to our pretraining corpora. The baseline model performed the best on the multiclass Twitter hate speech dataset, potentially illustrating domain specific pretraining's inability to classify hate speech accurately outside of its specific domain. Notably, the Parler model also achieved similar results to the baseline model for the Twitter dataset which we attribute to the size of the pretraining corpus. Our exBERT model did not show improvements to the baseline due to limitations in the existing exBERT codebase. Future works include exploring models and datasets that are able to make classification improvements without the computational requirements needed to train on large corpora.

BioXtract: Learning Biomedical Knowledge From General and Random Data

img
The privacy of medical documents and protected healthcare information can oftentimes limit the accessibility of accurate biomedical natural language processing models. Distillation can be used to transfer knowledge from these models, but it typically relies on having related data to distill on. In this work, we investigate the distillation of BERT-based biomedical models using transfer datasets from varying domains, including general data, randomized general data, and biomedical data. We find that general data can be used to learn task-specific biomedical knowledge, especially when we can initialize student models with similar weights to the teacher. We observe that randomized general data can also be used to transfer knowledge, but it is not as effective as general data. We hope that our findings bring attention to both the benefits and potential dangers of the widespread use of mixed-domain pretraining in NLP, particularly relating to models that continue their pretraining process on private data.

Race-Blind Charging

img
Our work explores the problem of redacting racial information from free-text police incident narratives used by prosecutors to make charging decisions. In addition to describing the incident leading to an arrest or citation, these reports often contain the race and physical description of the suspect. Recent studies have shown that there is reason for concern that the judgments made by prosecutors using these reports may suffer from explicit or implicit racial bias. In this paper, we apply several deep learning approaches to the problem of obfuscating a suspect's race through redaction. We make use of pre-trained models to mitigate data availability issues, and ultimately show that the use of unsupervised pre-trained models fine-tuned on downstream tasks, like named entity recognition, are competitive with the performance of past algorithms designed for this problem, and notably, do not require labeled data or additional human inputs.

Text Ads Generation Using Deep Neural Network

img
Automatic text summarization has been a well-researched NLP topic in recent years. It is possible to build machine learning models that are capable of distilling crucial information from a larger piece of text and condensing it to a smaller one. Text summarization using deep neural networks has become an effective approach and there are many use cases for that technique. One possible use case is text ads generation in online search advertising. Many advertisers are publishing text ads with ad titles that are not effective. Less effective ad titles will result in a lower chance of user conversion (clicks), which is harmful to the advertisers, and also a waste of hosting resources to the ads marketplace. It is meaningful to build a model that could rewrite the adtitle by summarizing the ad content. In this paper, we will study and leverage several state-of-the-art text summarization models, compare their performance and limitations, and finally, propose our own solution that could outperform the existing ones.

Intelligent Text Compression

img
Compression is an essential tool that enables efficient transmission of information across the globe. Inspired by the Poesia et. all paper, "Pragmatic Code for Autocomplete," we built a system that applies techniques used for the autocomplete task to the task of natural language text compression. Utilizing a models' learning of contextual language understanding, we formulated a method that compresses and decompresses text in a way that achieves a greater compression rate than standard methods (i.e. gzip) with high accuracy (low information loss). Through our experiments we explored and characterized the trade off between decompression accuracy and compression size, and framed the compression problem as a word level NLP masked language problem. We created two compression algorithms (one based on TF-IDF, the other inspired by the Poesia et. all paper) and found that a more aggressive compressor (Poesia) does the best job reducing file size, however, at the expense of decompression accuracy. We also found that larger compression vocabularies increase compression also at the expense of decompression accuracy. While none of our compressors outperform Gzip alone, they all can outperform standard compressors when applied in tandem with such compressors. When holistically evaluating our systems both objectives, we found that TF-IDF with N=300 does the best job of balancing the trade-off between decompression accuracy and compression size as it is able to reduce file size when applied with Gzip to 42.11% of the original file size while maintaining 99.81% overall compression accuracy.

SleepTalk: Textual DeepDream for NLP Model Interpretability

img
We propose SleepTalk, a technique for improving interpretability of pre-trained NLP models. Greater interpretability of black box neural networks is imperative for mitigating bias, developing trust in implemented models, and advancing intuition for better transfer learning. Thus, SleepTalk provides an approach to gain human-interpretable data on the learned representations of neurons in large neural networks. We also augment SleepTalk to be used on the adjacent task of unsupervised textual style transfer; synthesizing output text from just a content reference input and a style reference input. We assess the interpretations of SleepTalk and its behavior at different layers and qualify its resulting outputs of pre-trained NLP models. Finally, via results from SleepTalk, we suggest similarities between NLP neural network models and layers of the human brain's temporal and parietal lobe, structures critical to the formation of thoughts.

Evaluating Extractive Text Summarization with BERTSUM

img
In this paper we dive into how effective a pre trained BERT trained on a CNN and DailyMail dataset can summarize news content. The Focus is on the evaluation of the algorithm BERTSUM using metrics such as ROGUE and LEAD3. This paper comes to the conclusion that evaluating these text summarizing models through ROUGUE metrics, or other metrics such as BLEU scoring for question answering, are hard to quantify. Additionally our results show how the BERTSUM model gives better ROGUE recall scores than LEAD3 summarizations.

Understanding Emotion Classification In Audio Data

img
In 2020, 51% of US adults used a voice assistant and in every situation the smart device was ignorant of the users' vocal emotions. To help address this problem, our project empirically identifies the best audio representations and model architectures to use for spoken language emotion classification through extensive experimentation. We propose three new architectures which beat the existing state-of-the-art audio sentiment classification systems for our dataset (RAVDESS). We demonstrate that MFCCs and Mel spectrograms are the most important audio representations for this task and that Tonnetz representation is a decently powerful accuracy booster. Lastly, we reveal large gender disparities in classification accuracies for the most complex models.

Clinical Note Generation to Address Physician Burnout

img
Physician burnout contributes significantly to the decreasing quality and personalization of clinical visits. Burnout in large part can be attributed to tedious and inefficient current EHR systems, as each visit can cost several hours of documentation afterwards. In this paper, we build ClinicalGPT-2, a language model that helps generate clinical note contents, which could be deployed as part of an auto-complete system to increase efficiency of the clinical visit documentation process. We are one of the first to utilize the GPT-2 architecture for this objective, and results show that a small GPT-2 model finetuned on the MIMIC-III clinical note corpus can replicate note structure quite dependably. Furthermore, it often fills in contents of reasonable length and semantic appropriateness. The same model struggles to handle medical abbreviations, special characters, and nuanced formatting, illustrating the importance of data quality and pre-processing. Our findings hold great implications for the feasibility of using such models in text-prediction software under real-life clinical settings.

Evaluating BERT on Question Exhaustivity Inference

img
We as pragmatic listeners are able to infer the intended exhaustivity of a question, even when the question is semantically underspecified. For example, if somebody were to ask you "Where can I get coffee around here?", you would probably answer by providing the names of a few nearby coffee shops. Even though it wasn't specified anywhere in the question, you would know that the questioner most likely doesn't want an exhaustive list of every coffee shop in the area. In this work, we explore the extent to which BERT can learn those kind of exhaustivity judgements. We first show that BERT, finetuned on a small dataset of questions and human judgements of exhaustivity, can predict these judgements with high accuracy (r = 0.65, r = 0.59, r = 0.76 and r = 0.7 across each category of interest). We also provide evidence that the model learns associations between the linguistic features that previous works have identified as influencing how people perceive exhaustivity judgement, and the magnitude of the judgement.

BigBirdFLY: Financial Long text You can read

img
The development of new architectures allows to process long input windows of text at once, overcoming both memory and computational constraints. New developments pushed maximum input windows to 65k+ words compared to the 512 BERT limit. We aim to explore, compare and improve state-of-the-art long window architectures to summarize long texts. We consider BERT (512 words), GPT-3 (2,048 words), and BigBird (4,096 words), and focus on the financial narrative domain, summarizing 100- to 200-page documents. We aim to test models with different maximum input size, exploring benefits and limitations. Long input windows allow to include wider context in the summarization process, avoiding out-of-context sentence extraction that can lead to changes at the sentence-level semantic. We compare extractive and abstractive methods on key aspects in the financial context, such as numerical accuracy and summary semantic. We show extractive methods (BERT-based) can retain sentence-by-sentence accuracy from text; however, the extraction process can produce fragmented summaries which can lead to a misleading interpretation. We also show abstractive methods (by introducing BigBirdFLY, a wide context summarization method based on BigBird) can produce fluent summaries. By using human evaluation, we reveal BigBirdFLY can produce summaries more similar to human-generated summaries, and excel in the human evaluation criteria --- whereas extractive methods are able to score high in automatic metrics (ROUGE). Finally, we explore how enhanced greedy sentence-selection methods exploiting long input window in a single step compare to recursive solutions based on Reinforcement Learning.

Whiskey GPTaster

img
Project summaries unavailable

Question Answering System Implementation Using QANet Architecture

img
Reading comprehension and question answering are critical natural language tasks that many modern NLP models benchmark against. Early reading comprehension models were mostly RNN based, as a result, their training and inference speed was slow. More recently, there has been a shift to use attention layers to improve the performance of those models through bidirectional attention. QANet's approach attempts to bring the latest NLP innovations to the Question Answering problem domain. The QANet authors introduce a novel architecture that achieves both fast and accurate performance. Drawing inspiration from the Transformer paper, the QANet model encoder consists exclusively of convolution and self-attention. The feed-forward nature of this architecture drastically improves training and inference performance, while maintaining good accuracy. In my project, I implemented the baseline (unaugmented) QANet model from scratch and evaluated it on the SQuAD 1.1 dataset. My analysis of the QANet performance has showed that, overall, the model still makes errors that are common to question answering deep learning models. For example, it still struggles with inference, bridging, analogy, as well as logical analogy. Also, I observed that the model tends to get confused by multiple prolific context entries, which causes it to discard relevant context information while focusing on the phrases that are most similar in structure. Finally, QANet also tends to fail when there seem to be more than one viable answer to questions. Possible ways to alleviate those issues could be to add more self-attention heads to the QANet architecture while decreasing the number of encoder blocks. I believe, that could help the model better learn global dependencies and could also improve the model's ability to perform logical reasoning. Moreover, using pre-trained contextual embeddings, or adding other features to the input vectors (e.g., named entity types) might also improve the model's performance.

Low or no resource domain adaptation for task specific semantic parser

img
Semantic parsers are a critical component for voice agents (like Amazon's Alexa or Apple's Siri). Deep learning models can convert natural language to semantically parsed texts but requires large dataset to learn from individual topics (or domain) where they must do predictions. Since there are no limit to the possible number of topics for a conversational bot to operate, hence, developing a robust semantic parser becomes challenging. This bottleneck can be overcome by different domain adaptation techniques where the semantic parser trains on large source domain data and can easily adapt to a new target domain with very little target domain training data. But the effect of choosing a source domain has on the prediction capability of the model on target domain is very less explored and that is what we study in this research work. We concluded from this study that the choice of source domain is critical for domain adaptation tasks and can significantly increase or decrease the prediction accuracy of semantic parser models when operating on a specific target domain. We also identified a method to choose the best fit source domain for a specific target domain using cosine similarity score. Furthermore, we propose a novel method of designing a semantic parser model without any target domain training data and no target domain training and yet the model would be able to make good predictions in the target domain.

DeepAiNet: Deep NLP-based Representations for a Generalizable Anime Recommender System

img
Traditionally, recommendation systems require a long history of user-item interactions in the form of a large preference matrix to perform well, making themimpractical without large datasets. We aim to build a successful content-driven recommendation system that takes a hybrid ground between collaborative filtering (CF) approaches based off of a preference matrix, and nearest neighbor approaches based off of self-supervised embeddings. Specifically, we develop a deep learning, NLP-based anime recommender system named DeepAniNet on top of representations of anime shows called anime2vec. We explicitly train our model to reconstructuser-anime relevance scores, for shows with few or zero interactions. Our goal is to demonstrate that deep NLP approaches can extract rich content features to improve both a recommender system's performance and ability to generalize to new users and anime.

What Did You Just Say? Toward Detecting Imperceptible Audio Adversarial Attacks

img
Modern adversarial attacks against automatic speech recognition (ASR) systems are vicious and often undetectable to the naked ear. An adversary might play what appears to be classical music but in fact is transcribed by a smartphone as "Hey Google, send $1000 to 123-456-7890". We propose a two-fold mechanism for automatically detecting such attacks. Firstly, we train an LSTM over the output logits of Mozilla's DeepSpeech (Bi-LSTM) model on both adversarial and non-attacked examples to classify such examples, and achieve a 99.1% validation accuracy. Moreover, we pass the raw audio through DeepSpeech twice, once under a 512-frame window MFCC transform and once under a 256-frame window, and compute both the character error rate (CER) and word error rate (WER) between the two output transcriptions. We set a threshold on the CER, classifying everything over 40% error between the transcriptions as adversarial and everything under such as benign, and are able to achieve 99.3% validation accuracy. Our results show that a) adversarial attacks against models are well-characterized by a model's own feature representations, and a detector can be easily trained on such, and b) iterative optimization-based attacks (such as Carlini-Wagner's audio adversarial attack, as used here) which seek to create minimal noise are highly vulnerable to disruption and rely on an exact inference process, which can be broken down by e.g. altering a preprocessing hyperparameter.

I Have(n't) Read And Agree To The Terms Of Service

img
People interact with legalese on a daily basis in the form of Terms of Service, Cookie Policies, and other agreements that (for the most part) are mindlessly accepted. In order to give people transparency into the services they use, we aim to develop an abstractive summarizer which simplifies these legal documents. Leveraging an existing dataset of human interpretations of Terms of Service, known as TOS;DR, we developed a two-step pipeline involving extractive summarization followed by text simplification. We use state-of-the-art, transformer-based methods for both steps of this pipeline. Specifically, for extractive summarization we use the CNN/DM BertExt model presented by Liu et al. For text simplification, we used the ACCESS model presented by Martin et al. We first established a baseline of our extractive summarizer performance via ROUGE, our text simplifier performance via SARI and FKGL, and the performance of our pipeline end-to-end via ROUGE. These baselines were established using the pretrained models provided by the authors of these papers. We then employed a variety of data cleaning and data augmentation techniques, as well as model finetuning to improve upon these baseline results. We demonstrate that these techniques result in an improvement in model performance on legalese, both when these models are used in isolation, as well as when these models are used end-to-end. While we did improve on our baseline results, our qualtiative analysis demonstrates that these models are not near the level required for a production-level system, and our results demonstrate shortcomings in the adaptability of these large transformers when pretrained on specific, parallel datasets.

CheXGB: Combining Graph Neural Networks and BERT for automated radiology report labeling

img
Healthcare systems wish to utilize the large quantities of unlabeled free-text radiology reports for training medical image models. Automated labelers allow healthcare systems to annotate tens of thousands of reports without expensive labor from doctors which would enable many hospitals around the world to train AI systems on their data. We propose CheXGB, an automated labeler that combines global information encoded by a heterogeneous graph of the free text reports and their associated words from a large chest X-ray data set (MIMIC-CXR) with local context information encoded by BERT. The input to CheXGB is a heterogeneous graph consisting of reports (both labeled and unalabeled) and words. First, all heterogeneous graph nodes are fed through TextGCN while only labeled reports are passed to BERT. Second, attention is performed on the output of BERT and the nodes corresponding to the labeled reports. Finally, the output of the attention layer is passed through a linear layer for multi-label class prediction. Using explicit global relations encoded by a graph neural network allows for inputs that purely NLP models are not trained to provide which is particularly useful in the data sparse regime we study. We find that variants of CheXGB outperforms CheXbert -- the current state of the art in radiology report labeling -- in 13 out of 14 classes and improve the average kappa across tasks from 0.830 to 0.843.

Recursive Transformer: A Novel Neural Architecture for Generalizable Mathematical Reasoning

img
Recent works in deep learning investigate whether neural models can learn to reason mathematically. A common finding is that models seem to perform non-logical shortcuts when generating an answer, causing them to fail to generalize to more complex arithmetic problems. We create a model that is capable of reducing a complex problem into its respective subparts, taking logical intermediate steps to arrive at an answer. We do so by introducing a recursive framework to the traditional transformer architecture in two different approaches: 1) a strongly supervised variant which teacher forces each recursive step and 2) a weakly supervised approach which does not constrain the model's intermediate solutions. The strongly supervised approach not only successfully learns complex addition and subtraction but also demonstrates its ability to extrapolate by performing well when the number of operators increases. We also found that some of the models trained using our approach learned human-interpretable representations of numbers as well attention parameters that illustrate their problem-solving process. These results are a testament to the promise of the recursive transformer approach.

Seeking Higher Truths and Higher Accuracies with Multilingual GAN-BERT

img
Buddhist scriptures are often intentionally written to mirror the style of prior scriptures and quote prior texts verbatim. Moreover, the Buddhist canon is not uniform, split across many languages and schools. We therefore set out to build a model that accepts text from various languages and predicts the overall branch of Buddhism the text originates from, as well as the specific school of origin, formulated as two separate multi-class problems, respectively. In an effort to incorporate and improve upon state-of-the-art approaches in low-resource NLP tasks, we re-implemented and refined the GAN-BERT architecture to investigate methods to enhance fine-tuning for BERT. We also investigate the performance of standalone BERT, mBERT and LSTM models. We report that the LSTM model without pretrained embeddings obtains the highest accuracy on the 17-class classification task.

Quant-Noisier: Second-Order Quantization Noise

img
Memory and compute constraints associated with deploying AI models on the edge have motivated the development of methods that reduce larger models into compact forms. One such method, known as scalar quantization, does so by representing a neural network with fewer bits. For instance, instead of using 32 bits for each parameter, we can represent them with 8 or fewer. However, naively applying quantization to a neural network after training often leads to severe performance regressions. To address this, Google published a method called "Quantization Aware Training," which applies simulated training-time quantization for the model to learn robustness to inference-time quantization. One key drawback of this approach is that quantization functions induce biased gradient flow through the network during backpropagation, thus preventing the network from best fitting to the learning task at hand. Facebook AI Research (FAIR) addressed this issue by proposing "Quant-Noise," in which they apply simulated quantization to a fixed proportion, called the "noise rate," of parameters during training rather than all of them. FAIR's methods set a new state-of-the-art for quantization. Our method, "Quant-Noisier," builds upon their technique by using a variable noise rate instead of a fixed one, which we term "second-order noise." We craft four candidate functions to vary noise rate during training and evaluate the variants with 129 experiments-3 datasets, 3 quantization schemes, several methods, and 3 random seeds for most trials. Quant-Noisier with a variant stochastic second-order noise outperforms Quant-Noise on two out of three quantization schemes for all three tested datasets. Moreover, on two of the datasets, our method at 4x compression matches or exceeds performance of even the uncompressed model. We hope that our novel compression approach improves the tractability of model training and inference for a wide range embedded computing applications.

GLARE: Generative Left-to-right AdversaRial Examples

img
Although well-studied in computer vision, adversarial examples are difficult to produce in the NLP domain, partially due to the discrete nature of text. Previous approaches (drawing on constrained methods such as rule-based heuristics and synonym substitution) have attained relatively limited success, largely because these approaches consider neither syntactic nor semantic structure. As a result, adversarial examples yielded by these models often suffer from a lack of grammaticality, idiomaticity, and overall fluency. Recently, transformer models have successfully been applied to adversarial example generation, vastly outperforming previous state-of-the-art approaches. Current transformer-based textual adversarial frameworks use the masked language models (MLM) BERT or RoBERTa to generate word-level replacements, essentially re-purposing their pretext tasks (masked token prediction). Yet this is not an ideal fit-ultimately, an MLM's objective is not text generation. This is, however, precisely the explicit objective of another class of models: generative language models. Therefore, we propose a novel textual adversarial example generation framework based on generative LMs (rather than MLMs). Those familiar with GPT-2 may ask: doesn't it (unlike BERT) not benefit from bidirectional context? To address this shortcoming, we adopt the ILM (infilling language model) framework introduced by Donahue et al. 2020, which allows the model to read the entire sentence before infilling. Our method (GLARE) generates word- and span-level perturbations of input examples using a fine-tuned ILM model compounded with a word importance ranking algorithm. Notably, our algorithm is able to easily insert spans of arbitrary length, something that neither CLARE nor previous approaches achieve. Armed with the best of both worlds-ease of generation and context-GLARE is able to outperform CLARE, the current SOTA, on a variety of metrics (attack success rate and cosine similarity between the perturbed & original text) while simultaneously increasing output fluency.

Document Matching for Job Descriptions

img
This project aims to train a encoder for the purpose of obtaining embeddings of job descriptions, with the aim of classifying them to standardized job roles.

Template-free organic retrosynthesis with syntax-directed molecular transformer networks

img
Retrosynthesis-the process of identifying precursors that can be used to synthesize a product-is one of the fundamental problems in organic chemistry. The advent of generative deep learning models has rapidly improved template-free retrosynthesis planning, where a retrosynthetic step can be modeled as a sequence-to-sequence task between the string representations (SMILES) of the molecules involved in the reaction. However, many existing methods either prune reaction datasets of important stereochemical information, or they output SMILES strings that are often not syntactically correct. We address both of these issues by developing a syntax-directed molecular transformer (SDMT), trained on template- and rule-free reaction data without removal of stereochemical designation. SDMT adds a lightweight modification to the traditional transformer architecture by using the syntactic dependency tree of the input SMILES string to restrict self-attention. SDMT performs competitively in accuracy with the current state-of-the-art text-based and graph-based retrosynthesis models, while outperforming them in invalid SMILES rate. We show that SDMT more consistently outputs syntactically and semantically valid SMILES strings across all top predicted results, and it can be used as an effective way to directly integrate the syntactic structure of SMILES strings into transformer models for reaction prediction.

MedDRA2Vec: Training Medical Graph Embeddings for Clinical NLP

img
Creating word embeddings for biomedical terminology is an up-and-coming field. While vast amounts of data are available--encoded in texts, medical codes, ontologies, and patient data--researchers have struggled to create medical term embeddings that are widely used, reproduced, and studied, since many primary data sources of clinical notes and patient data have restricted access. In response to these challenges, our research derives embeddings from an open-access resource, namely MedDRA, the Medical Dictionary for Regulatory Activities. We use two different embedding methods, Poincare and Node2Vec, to derive embeddings and evaluate these embeddings using cosine concept similarity and the prediction of diagnoses from a patient visit given the diagnoses of a previous visit. Our research shows that on concept similarity and patient diagnosis tasks, MedDRA embeddings are comparable, or even better in some cases, to BioBert embeddings. The latter embeddings are derived from a BERT model that is additionally trained on large biomedical corpora, namely PubMed articles. The MedDRA embeddings also performed close to the accuracy of Snomed2Vec embeddings on the patient diagnosis task, although slightly worse. Snomed2Vec embeddings are trained using the SNOMED-CT ontology, which is one of the largest open-access biomedical ontologies. Our research shows that ontologies other than SNOMED-CT can be used to derive competitive medical term embeddings. We also provide our full code base and resulting embeddings for recreation and further research within the biomedical NLP field.

Exploring RoBERTa's theory of mind through textual entailment

img
Can transformer models reason about the thoughts of other humans, the same way we can? Within psychology, philosophy, and cognitive science, theory of mind refers to the cognitive ability to reason about the mental states of other people, thus recognizing them as having beliefs, knowledge, intentions and emotions of their own. In this project, we construct a natural language inference (NLI) dataset that involves theory of mind inferences related to knowledge and belief. We test the dataset on RoBERTa-large finetuned on the MNLI dataset. Experimental results show that the model struggles with such inferences, including after attempts for further finetuning.

Six Approaches to Improve BERT for Claim Verification as Applied to the Fact Extraction and Verification Challenge (FEVER) Dataset

img
BERT has been used in various research for fact extraction and verification tasks, such as tweet classification, hate speech detection and fake news detection. However, BERT suffers from various issues when applied to claim verification, which can help detect and classify misinformation. The goal of our project is to implement the BERT model on the FEVER (Fact Extraction and Verification) task, specifically for claim verification, as well as suggest and implement six improvement approaches to the original BERT model. We aim to gain valuable insights into the effectiveness of various model improvements for claim verification and hope to support the conquest to combat the spread of misinformation on the internet with our experiments. We conducted an end-to-end analysis of improvements on BERT for claim verification specifically for the FEVER task, from pre-processing evidence via data augmentation (synonym replacement and back-translation), changing the transformer settings (BERT vs DistilBERT and number of epochs), and post-processing its results neurally. Our modifications did not result in significant changes to the FEVER score and BERT baseline remained as the best performing model. Applying our neural aggregation layer, however, did improve performance on the DistilBERT model. This may be because BERT is a large model with a lot of pre-trained knowledge, and so our changes in the fine-tuning process and aggregation layer may not have a large impact on the model's performance as much as on the smaller DistilBERT model.

Analysis of Bias in U.S. History Textbooks Using BERT

img
U.S. History textbooks have a profound influence on childrens' social understanding of the United States. This is the reason activists and social scientists analyze textbooks for issues on bias and representation. Computational NLP methods can provide more holistic analyses compared to traditional qualitative studies. Our research supplements prior word2vec analyses of gender word relations in 15 U.S. History textbooks used in Texas by taking advantage of BERT's versatility with two studies. First, we compare BERT's embeddings between gender and interest words (related to home, work, and achievement). Second, we mask out the gender word in each context and evaluate BERT's ability to predict the correct gender in different contexts. We repeat both studies on fine-tuned and pretrained BERT. Furthermore, our analysis is done with all textbooks taken as a collective, as well as stratified by historical time period discussed. Overall, we find that the textbooks contain idiosyncrasies that tend to associate women with "home" and "work" contexts more strongly than "achievement," and that these trends stay relatively constant over historical time periods discussed.

Toxicity Detection: Does the Target Really Matter?

img
Toxicity Detection: does the Target really matters? Toxic content moderation is key to keep the Internet safe. We show that the toxicity level of a comment can be better assessed by also considering its target (e.g. towards whom it is directed). Cherry on the cake: the target can also be automatically predicted from the comment with high precision.

Translating Natural Language Questions to SQL Queries

img
Text-to-SQL models have the potential to democratize data analytics by making queries as simple as asking natural language English questions. Sequence-to-sequence models have performed well at the Text-to-SQL task on datasets such as WikiSQL. However, most prior work does not examine generalizability of the models to unfamiliar table schemas. We build on the ideas introduced by Chang et al. to improve a sequence-to-sequence dual-task learning model by generalizing better on a zero-shot testbed which consists of schemas the model has never encountered before. We use the pre-trained BERT-based TAPAS transformer model to encode more expressive table representations for the schema, in addition to the existing BiLSTM-based encodings. Additionally, we use techniques from semantic parsing research such as the coverage mechanism and more flexible attention algorithms to propose a model that achieves a 5+% accuracy improvement over the base dual-task sequence-to-sequence model on the zero-shot test set.

Data Augmentation and Ensembling for FriendsQA

img
Rapid progress on Question Answering (QA) has been made in recent years. However, widely used benchmarks on QA, such as SQuAD, Natural Questions and NewsQA, mostly consist of passages from Wikipedia or other online sources, yet this is only one category of human languages. One other crucial aspect of languages comes in the form of everyday conversations, and understanding them is equally important for better machine comprehension on human languages. In this paper, we explore the FriendsQA, a question answering dataset that contains 1,222 dialogues and 10,610 open-domain questions based on transcripts from the TV show Friends. It is the first dataset that challenges span-based QA on multiparty dialogue with daily topics. We aim to improve model robustness and performance on FriendsQA dataset via data augmentation and ensembling. We generated 4 new training datasets of well-paraphrased contexts and questions through back-translation. We proposed a novel method to find answers in paraphrased text through the use of the sum of word embeddings. When looking for answers in the back-translated context, we compared phrases by taking the sum of word embeddings before the calculation of the normalized mean squared error. This method effectively compensates for the disadvantages of sentences that are paraphrased well but have long length compared to the original ones. We trained BERT on the augmented datasets, and then ensembled BERT-large models, pushing state-of-the-art F1 / EM scores on FriendsQA from 69.6 / 53.5 to 72.08 / 54.62.

Ranked Keywords to Story Generation

img
This project attempts the following task: given a set of ranked keywords, construct a coherent short story. The goal is for the model to use all the keywords, while still being grammatically and logically correct. One could imagine this task being used to inspire writers with creative story ideas. For example, given the words `josh, streets, living, adopted, happy', the model could output: Josh is a black dog. He was living on the streets. A nice man stopped when he saw Josh. He became attached to Josh. So the man adopted Josh, and Josh is very happy with his new family. To solve this task, we use both the traditional method of finetuning large pretrained language models as well as the recently introduced Plug and Play Language Model(PPLM) strategy, which leverages the power of pretrained language models without finetuning them. For the Plug and Play strategy, we introduced custom attribute models to guide language models to generate stories containing the desired keywords, especially those with higher rank. Unlike the original PPLM paper which focuses on perturbing the generation of zero-shot unconditioned language models, we experimented with zero-shot, low-resource, and fine-tuned language model choices, and compared the relative improvement in the PPLM generations. We found that finetuned langauge models perform much better than the default PPLM approach, but our custom combination of finetuned language model + attribute model performed the best overall. Finally, we performed error analysis on all our approaches and find that in spite of introducing more grammar mistakes, PPLM improved keywords usage, reduced the number of contradictory sentences in the stories, and generated stories with better endings.

Predicting Hedge Fund Holdings from 10K/Q text Analysis

img
Project summaries unavailable

Multilingual CheXbert: Radiology Report Labeling in Spanish

img
Automatic label extraction from free-text radiology reports enables efficient and large-scale training of natural language processing models for the medical setting.The current state-of-the-art label-extraction model, CheXbert, has been shown to work well on English-language radiology reports, but has not yet been tested in the multilingual setting. In this work, we explore how well Multilingual BERT performs on Spanish-language radiology reports. We find that regardless of whether the model is finetuned on English reports or Spanish reports, Multilingual BERT offers no real performance gains over English BERT when evaluating on Spanish-language reports. Furthermore, we show that while finetuning on human-labeled reports is better than finetuning on automatically-labeled reports, finetuning first on automatically-labeled reports and then further finetuning on human-labeled reports offers the best results.

Are Captions All You Need? Investigating Image Captions for Multimodal Tasks

img
In today's digital world, it is increasingly common for information to be multimodal: images or videos often accompany text. Sophisticated multimodal architectures such as ViLBERT and VisualBERT have achieved state-of-the-art performance in vision-and-language tasks. However, existing vision models cannot represent contextual information and semantics of images like transformer-based language models can. Fusing the semantic-rich information coming from text becomes a challenge. In this work, we study the alternative of first transforming images into text using image captioning. We then use transformer-based methods to combine the two modalities in a simple but effective way. We perform an empirical analysis on different multimodal tasks, describing the proposed method's benefits, limitations, and the situations where this simple approach can replace large and expensive handcrafted multimodal models.

Classifying Emotions in Real-Time

img
Currently, deep learning systems have difficulty understanding human emotion in real-time. This difficulty has negative implications in a variety of real-word situations such as chatbots and virtual assistants. The goal of this project is to resolve this situation by building a system that can understand human emotion in real world dialogues. To tackle this problem, we take advantage of the EmotionLines corpus which consists of dialogues labeled by utterance. We define our task to be real-time utterance-level emotion recognition (ULER), with real-time meaning that our system only can see previous utterances within a dialogue. Ultimately, we were able to both build a series of multi-level models and fine-tune BERT on a few different tasks to achieve improvement on the CNN baseline from the EmotionLines paper. Finally, in anticipation of future work, we collected a dataset of brief dialogues between users and virtual assistants labeled by errors. Our hope is that by using real-time ULER, future systems can learn to associate user emotions, such as surprise or anger, with virtual assistant errors.

GameWiki: Aspect Extraction for Video Games

img
In this project we aim to 1) Predict if a game review is helpful or not 2) Extract aspects from helpful game reviews. This is helpful for gamers in identifying the most interesting aspects as well as the most disliked aspects of a game before purchasing it. ULMFit model is trained on Steam reviews to identify if a review is helpful or not. The trained ULMFit model predicts Metacritic reviews if they are helpful or not. Predicted reviews are split by top 3 genres based on number of games - action, sports and fantasy. Predicted Metacritic reviews for each genre are fed into the aspect extraction model. For aspect extraction, we have used an unsupervised neural attention model. Traditional topic models for aspect extraction tend to not have highly coherent aspects and they don't have very high interpretability since they consider all words are generated independently. This model improves coherence since it uses neural word embeddings which consider the distribution of word co-occurrences. Further, the interpretability of the aspects have been improved by splitting the dataset by genre. It was able to predict more granular aspects particular to each game genre as opposed to aspects generated from the entire gaming dataset.

Automatically Neutralizing Ableist Language in Text

img
Ableism involves the systemic oppression or discrimination against people with disabilities. It is often reinforced through language that perpetuates harmful biases and stigmatizes those with disabilities. However, such language can often be difficult to detect due to its pervasiveness in mainstream media. To address this issue, we introduce the first parallel corpus of ableist language, as well as a model for natural language generation that automatically brings ableist text into a neutral point of view. Our corpus contains 1500 sentence pairs that originate from movie scripts, news articles, and speech transcripts. Our language generation model is a concurrent system that utilizes a BERT encoder to identify and replace ableist words and phrases as part of the language generation process. In addition, we contribute a self-training pipeline that can generate more training data for the task of neutralizing ableism, as well as a novel evaluation method to more quantitatively assess a model's prowess at reducing bias. Human evaluation and our novel evaluation method suggest that these data and models are a first step towards the automatic identification and reduction of ableism in text.

Transformers for Textual Reasoning and Question Answering

img
Transformers are the predominant architecture of choice for current neural reasoning tasks. This is because of their great performance, but also because of their great convenience with the advent of pretraining large models such as BERT and GPT and simply fine tuning them on different tasks. However, the canonical transformer models have been found to learn and rely on heuristics for evaluation when trained on simple training sets such as SQuAD or RuleTakers which simply require local phrase matching or shallow textual reasoning. In particular, the high performance transformers achieved on these tasks cannot demonstrate their ability to learn long-range relations and a holistic understanding of the text. We propose methods of reducing the attention mechanism of the transformer to be sparser so as to encourage generalized learning as well as using a semi-synthetic dataset to generate training and testing examples to encourage robustness and inference properties. Our results demonstrate that these changes yield improvements in performance for difficult reasoning tasks, generalizability, and learning efficiency.

PopNet: Evaluating the Use of LSTMs and GPT-2 for Generating Pop Lyrics From Song Titles

img
Many artists now use lyricists to write the lyrics for their songs. We thought that it would be interesting to implement models which are able to take the place of a pop lyricist and generate song lyrics. Currently, LSTM models have been used to generate lyrics and verses, but not songs, and GPT-2 models have been shown to be effective in creative text generation problems, but have not yet been used to generate song lyrics. We implemented LSTM models and fine-tuned GPT-2 models to take in a song title then generate either 1) a line of lyrics, 2) the lyrics for a song verse, or 3) the lyrics to an entire song because we thought it would be interesting to characterize the behavior of LSTMS at generating longer pieces of text, and employ GPT-2 on a new task. Through perplexity scores, BERT scores, and human evaluation results, as well as qualitative evaluation, we are able to see that our fine-tuned GPT-2 and LSTM models are able to greatly outperform our baseline, the out-of-the-box pre-trained GPT-2 model in generating pop lyrics, verses, and songs. Through our human evaluation, we find that a fine-tuned GPT-2 is able to generate realistic pop lyrics and verses, and decent pop songs. The fine-tuned GPT-2 model outperformed the LSTM model in all three generation tasks, most likely due to difficulty that the LSTM cell state has in preserving upstream information and the difficulty that LSTM attention has in identifying different relevant parts of the input for different parts of the output.

Fake News Detection and Classification with Multimodal Learning

img
In recent years, the prevalence of fake news has increased significantly with the rapid progress in digitization and the rise of social media. It has harmed our society greatly by spreading misinformation and escalating social issues. To combat the spread of misinformation in multiple modalities, we experimented with various new multimodal machine learning models and mutimodal feature fusion techniques to improve the current benchmark on fake news detection with Fakeddit dataset. Although the baseline results from the dataset authors are already quite impressive, we believe more sophisticated visual/language feature fusion strategies and multimodal co-attention learning architecture could capture more semantic interactions/associations between visual and language features that come in pairs in fake news. The understanding of visuals should be conditioned on the text, and vice versa. This belief motivated us to explore several new approaches to this problem including mBert, MuRel, as well as ViLBERT after implementing the baseline model as a benchmark. Our experiments demonstrate the importance of learning associations between the two modalities and aligning visual and text signals in the fake news detection task. Also, learning visually-grounded language understanding has also been proven to be transferable and pretrainable among different vision-and-language tasks.

Moral Style Transfer

img
Moral reframing is the process of framing a statement in a way that is consistent with an individual's moral values to garner support from that individual. Prior studies have shown morally-reframed messages help in persuasion on a range of topics, from environmental protection to reducing political polarization. We developed a moral classifier and a generator to morally reframe texts, which is the first NLP approach to do so. Specifically, we developed a moral classifier that can predict the underlying morals in a text using a BART-based architecture and a moral reframer that can perform moral reframing of an input text based on a target set of morals using a variational autoencoder architecture. Our best moral reframer model achieves F1 scores of 0.998 on our test set and a BERTSCORE (using BART) of 0.972, thus being able to perform moral reframing while preserving content.

Applying Transformers and NLP Computational Techniques to America in One Room

img
Project summaries unavailable

How Low Can You Go? A Case Study in Extremely Low-Resource NMT

img
Neural Machine Translation (NMT) has improved dramatically in the past decade, with many NMT systems for high-resource languages approaching human-quality translations. However, many of the world's languages are low-resource, with very little digitized parallel data available to train NMT models for them. Although there have been many advancements in developing techniques for low-resource NMT, many languages still have orders of magnitude less data than those used in the associated studies. One such extremely low-resource language is Cherokee, which has less than 15,000 parallel Cherokee-English sentences available. We present a case-study that evaluates the efficacy of common low-resource NMT techniques on Cherokee-English (ChrEn) translation. We analyze the performance of data augmentation, noisy self-training, back-translation, aggressive word dropout, pre-trained word embeddings, pre-trained decoders, mT5, and additional LSTM layers on improving a ChrEn NMT system. We find that pre-training the decoder with 100,000 monolingual English Sentences and back-translation using 5,000 English sentences offer a 0.9 and 0.8 BLEU score improvement over the baseline, respectively, while noisy self-training and aggressive word dropout provide inconsistent benefits in this extremely low-resource setting.

Annotating Sparse Risk Factors in Clinical Records with BERT

img
Though there is an abundance of medical information collected in patient clinical records, these records are typically in the form of fragmented free text, such that the task of extracting the relevant pieces can be costly. In this project, we revisit the 2014 i2b2 challenge for identifying risk factors for heart disease in clinical records, focusing on annotating the smoking status and family history of cardiovascular disease, two of the most difficult risk factors in the challenge due to the sparsity of their less common classes. The teams participating in the 2014 challenge applied a combination of hand-written rules and classifiers such as SVM; the objective of this paper is to adapt more recently developed transformer models for this task in order to evaluate the suitability of these models and to understand whether these models can be trained as a substitute for more explicit reasoning in rule-based systems. Fine-tuning BERT, as well as clinical BERT and blueBERT -- two BERT-initialized models further pre-trained for the clinical and biomedical domains, we find that clinical BERT and blueBERT achieve slightly higher F1 scores than BERT, but within margin of error. Moreover, we find that basic oversampling and class weighting approaches to address the class imbalance do not improve the overall performance of the BERT models on this task, as the tradeoff weakens the model's performance on more common classes. The extraction of the span of text within a clinical record most relevant to the risk factor, and the length of the span that is extracted, however, do significantly impact the performance -- and for the smoking risk factor, with simple heuristics for extracting the relevant part of a clinical record, BERT models achieve performance comparable to many of the highest scoring models from the 2014 challenge.

ChePT - Applying Deep Neural Transformer Models to Chess Move Prediction and Self-Commentary

img
Traditional chess engines are stateful; they observe a static board configuration, and then run inference to determine the best subsequent move. Additionally, more advanced neural engines rely on massive reinforcement learning frameworks and have no concept of explainability - moves are made that demonstrate extreme prowess, but oftentimes make little sense to humans watching the models perform. We propose fundamentally reimagining the concept of a chess engine, by casting the game of chess as a language problem. Our deep transformer architecture observes strings of Portable Game Notation (PGN) - a common string representation of chess games designed for maximum human understanding - and outputs strong predicted moves alongside an English commentary of what the model is trying to achieve. Our highest performing model uses just 9.6 million parameters, yet significantly outperforms existing transformer neural chess engines that use over 70 times the number of parameters. The approach yields a model that demonstrates strong understanding of the fundamental rules of chess, despite having no hard-coded states or transitions as a traditional reinforcement learning framework might require. The model is able to draw (stalemate) against Stockfish 13 - a state of the art traditional chess engine - and never makes illegal moves. Predicted commentary is insightful across the length of games, but suffers grammatically and generates numerous spelling mistakes - particularly in later game stages. Our results are an insight into the potential for natural language models to gain tractability on tasks traditionally reserved for reinforcement learning models, while additionally providing a degree of insight into the decisions mate. These findings significantly build on the work of Noever et al., Jahmtani et al., and Zang et al.

Translating Code-Switched Texts From Bilingual Speakers

img
Code-switched language is commonly found in interactions between bilingual individuals, yet has not been optimized in NMT. A recent machine translation paper concluded that it would be interesting to identify implicitly the language system of foreign word segments and carry out the translation with an appropriate translation system. Therefore, the goal of this project is to create a model best suited for code-switching translation tasks based on this intuition. Specifically, given an input of bilingual, or code-switched text, we want to create a model that outputs a translation of the text in one desired language. We experiment with two approaches for this task. First, a LID-Translation Model Pipeline approach, a two model pipeline that 1) uses a language identification model to figure out what words need to be translated in a bilingual text and 2) translates these identified words via a standard language translation model. This approach includes a language translation model we have fine-tuned and was motivated by the fact that our code-switched dataset did not have ground truth translations. Second, during the course of our project, a new code-switched dataset was released, and thus we also trained a Direct-Translation Bilingual Model that we created with the newly released bilingual data. This bilingual model was also tested with and without the LID-Translation Pipeline. Our results show two main findings. First, the LID-Translation Model Pipeline performs better than a direct translation pipeline. Second, the Direct-Translation Bilingual Model performs better than regular translation models. These results suggest that there is significant potential for machine translation optimized for code-switched data, particularly with the recent rise of bilingual corpus availability.

Data Augmentation for ASR using CycleGAN-VC

img
There is a significant performance gap in ASR systems between black and white speakers, which is attributed to insufficient audio data from black speakers available for models to train on. We aim to close this gap by using a CycleGAN based voice converter to generate African American Vernacular English utterances from genericAmerican English utterances as a data augmentation strategy. By using a two-step adversarial loss and a self-supervised frame filling task, we were able to noticeably improve the qualitative performance of our CycleGAN based voice conversion pipeline. In spite of this, we could not establish the method of CycleGAN based voice conversion as a reliable method for data augmentation. While this project was challenging, it was especially rewarding to conduct this line of research which has the ultimate goal of ensuring that marginalized voices are heard.

Learning Representations of Eligibility Criteria in Clinical Trials Using Transformers

img
A clinical trial's eligibility criteria can have a significant impact on the successful completion of the study, as they determine essential factors such as the recruitment efficiency, patient withdrawal rates, and translational power. Most inclusion and exclusion criteria are written in free-text, which makes a systematic review and analysis of these criteria prohibitive on a large scale. In our project, we address these issues by learning standardized representations of eligibility criteria using transformers. In particular, we pretrain a BERT model on a large unlabeled corpus of eligiblity criteria acquired from ClinicalTrials.gov. Using Named Entity Recognition (NER) as a proxy for the quality of our representations, we show that our pretrained model (ecBERT) outperforms other publicly available biomedical BERT models, suggesting the benefit of domain-specific representations for eligiblity criteria.

Fine-Tuning Transformer-XL on Clinical Natural Language Processing

img
Although many applications based-on mining clinical free text has been developed, the state-of-the-art transformer-based models have not been applied in clinical NLP. We aim to address the long-range dependencies in clinical free text caused by different sections with the latest Transformer-XL model by fine-tuning it on MIMIC-III clinical text. Having requested and cleaned the MIMIC-III clinical text based on self-developed rules, we prepared the data for training classifiers on diagnostic code prediction of 8 common cardiovascular diseases. We used huggingface API to fine-tune and evaluate Transformer-XL model on MIMIC-III dataset and compared the results with baseline methods including bag-of-words and TF-IDF. And the Transformer-XL outperformed the Bag-of-Words and TF-IDF on 3 of 6 tasks, on which we have already got the results. Furthermore, the Transformer-XL was only fine-tuned for 1 epoch, and therefore we believe there is a promising potential for a better fine-tuned Transformer-XL to better predict the diagnostic codes accurately. The better accuracy of diagnostic codes aids in the structuring of free-text clinical notes, which can be better and easier for downstream machine learning tasks, such as survival predictions and multi-modality data fusion, because the structured diagnostic codes can be fed into machine learning models than the unstructured data.

Investigating Techniques for Improving NMT Systems for Low Resource Languages

img
Neural Machine Translation (NMT) has become the standard for Machine Translation tasks, however, they encounter many technical challenges when training in low resource language pairs. In this paper, we investigate how different subword and word representations, as well as different data augmentation techniques can improve NMT performance on low resource languages. For our baseline, we train an encoder-decoder based seq2seq NMT model on a scarce Nepali-English dataset. Then, we compare different subword and word representations, such as Byte Pair Encoding (BPE) and a reduced vocab set. Finally, we augment our training data with backtranslation of monolingual data, transfer learning from Hindi, and noisy data. In addition, we propose a new variant of backtranslation for low-resource NMT that exceeds performance of traditional backtranslation methods. We find that BPE was the best performing subword representation. For data augmentation, we find that transfer learning and noisy data gives reliable improvements, yet back translation requires careful management of noise levels. By utilizing our novel variant of backtranslation alongside BPE and auxiliary data methods in combined models, we are able to increase in-domain performance by +4.55 BLEU and out-of-domain performance by +3.93 BLEU compared to the baseline.

Pseudocode to Code Translation Using Transformers

img
Pseudocode to code translation is an open field of research, with work impacting a variety of disciplines. We approach the task by employing transformers for the task of pseudocode-to-C++-code translation, and do a comparative study with the earlier published results using LSTMs. We use human annotated C++ programs and corresponding pseudocode, which provide pairs of pseudocode-gold code line translations, which was made available by previous work. We considered as our research problem line by line pseudocode-to-code translations, in order to decompose the problem of whole program translation into smaller pieces, which allows us to treat the program synthesis as a search problem given candidate line translations. We experimented with different architectures, tokenizers and input types. While training a BERT-to-BERT model as an encoder-decoder model (under our time constraints) was not able to produce syntactically correct translations, the pretrained BART models was able to reach state of the art results once finetuned on our dataset. Furthermore, we observed additional benefit of not only feeding the pseudocode for the line, but additionally 1. pseudocode of the N preceding lines and 2. code of the N preceding lines, for N=5 and 10. This leverages the transformers' ability to learn long-term dependencies and supports the hypothesis that cross-line context is relevant for the task at hand.

Adversarial Approaches to Debiasing Word Embeddings

img
In recent years, word embeddings have been ever more important in the world of natural language processing: techniques such as GloVe and Word2Vec have successfully mapped words to n-dimensional vectors that store precise semantic details and improve the quality of translation and generative language models. Since word embeddings are trained on human text, however, they also reflect unwanted gender and racial bias over decades of societal history. In this work, we propose that bias can be mitigated through the use of Generative Adversarial Networks. We experiment with two different problem formulations. First, we experiment with a discriminator that attempts to identify the gender bias of a vector, paired with a generator that minimizes the discriminator's performance on the task. Second, we experiment with a discriminator attempting to complete word analogies and identify the gender bias of the analogy, paired with a generator that only minimizes the discriminator's ability to identify the gender bias. Preliminary results on the WEAT scoring system show that both methods were successful in eliminating bias on commonly-used job words; qualitative analysis on similar words also show that racially or gender charged synonyms were considered less relevant to the debiased vector.

Summarizations and Dragons

img
Summarization models are often trained and tested on constrained datasets, which could limit their ability to handle less structured or out-of-domain data, like conversations. Unlike existing datasets, the Critical Role dataset (CRD-3) of Dungeons and Dragons game transcripts is unstructured and conversational, but it still is centered around a shared game-playing goal and is of sufficient size to properly train and test these models \cite{crd3}. In this paper, I examine the dataset and identify the characteristics that differentiate it from previous data. Then, I implement an expanded version of the pointer-generator summarization model, and evaluate its performance on this dataset in order to identify which model choices and architectures are well-suited to the dataset, as well as the limitations this dataset reveals.

Hierarchical, Feature-Based Text Generation

img
We introduce the creation and use of full-text, distributed feature maps as the basis for the hierarchical generation of long-form text. The problem of story generation, a challenge that consists of generating narratively-coherent passages of text about a particular topic, can be described as one of the most difficult challenges currently posed in text generation, as stories require long-range dependencies, creativity, and a high-level plot. Previous efforts note that story generation often fails to meet one or more of these requirements; generated stories are frequently repetitive, and typically lack any kind of a broader arc. We find that the use of automatically-generated "emotion maps" as a basis for hierarchical generation achieve perplexity scores comparable to previous efforts, despite using a numerical input rather than a textual one. Additionally, we introduce a new story generation dataset, consisting of 100,000 one thousand word stories, each paired with a series of tags which contain genre, character, and other feature information. We demonstrate that use of fully quantifiable feature maps as a conditional basis for generation achieves results comparable to the state of the art on multiple datasets. We also introduce a method for quantifying feature map/story relationship, and use this metric to show that the feature maps have a limited, but extant relationship to the generated text. Future use of quantitative analysis in hierarchical generation will aid researchers in effectively constructing and using first-step prompts for story generation.

Exploring the Limits of the Wake-Sleep Algorithm on a Low-Resource Language

img
Machine Translation (MT) remains an open sub-problem in Natural Language Processing (NLP). Recently developed systems have achieved impressive, near-human performances on certain languages, but these systems are heavily dependent on large parallel corpora in the source and target languages. Thus, this approach is impractical for languages with sparse bodies of text; the real challenge lies in creating machine translation systems that are able to provide high quality, syntactically-correct translations with small amounts of parallel corpora. This work aims to utilize the wake-sleep back-translation algorithm introduced by Cotterell and Kreutzer \cite{Cotterell} in order to generate synthetic training data that will then be used on OpenNMT's pre-built models in order to achieve high-quality Yoruba-English translations. We also explore the effectiveness of the wake-sleep algorithm when used in-conjunction with in-domain data (in domain meaning data of a similar--or very closely related--subject as the original training data) versus out-of-domain-data (data not necessarily related in topic to the training data). We examine the performance differences between a smaller, in-domain dataset and a larger, less topical out-of-domain dataset and compare those results to our baseline. We find that performing the wake sleep algorithm on a small, in-domain dataset leads to a decrease in BLEU score of about 5 points when compared to the baseline model (14.47). Training on an out-of-domain dataset leads to a 6 point decrease in the BLEU when compared to the baseline.

Continuous Integrate-and-Fire Speech-Text Fusion for Spoken Question Answering

img
Project summaries unavailable

Using Recurrent Convolutional Neural Network (RCNN) to Predict S&P 500 Movements

img
Financial news releases and time series are important data sources to predict equity market directions and movements. The existing forecasting methods mostly work with each input independently. Nowadays, with increasing computational capabilities and artificial intelligence techniques, we could combine the predicting power of both sentiment analysis and systematic signals and models. In this project, we utilize information from daily top financial news headlines and seven technical indicators: Stochastic %K, Stochastic %D, Momentum, Rate of Change, William's %R, A/D Oscillator, and Disparity 5 constructed from historical S&P 500 returns. Those hybrid inputs can extend the ability of predicting short-term equity market directional movements, achieving higher accuracies than a single source system. This project implements the architecture of a RCNN model with attention mechanism. We examine various models with or without the attention mechanism or technical indicator inputs, and the effect from different embedding methodology for the financial news headlines. Among those variations, we find that technical indicators is positively related to better performance. In addition, embedding the news headlines by averaging the vector representations for each word in the headline will outperform the model only with word embedding padded to the maximum news headline length. Furthermore, after including the attention mechanism, the model can achieve the highest accuracy among models we implemented, further coming up with a novel approach in refining the baseline model architecture.

Identify Semantically Similar Queries in Virtual Assistant Dialogues

img
Currently for production voice assistants, it's hard to accurately understand feature usages from usage logs, because if a query is not parsed with the correct intent in the existing natural language understanding system, then it's unlikely to be counted towards the correct feature usage downstream. We translate this problem into a paraphrase mining problem: given an input example query, can we find other similar queries asking for the same feature in the raw text dialogue corpus? We leverage Sentence-BERT (SBERT), a finetuned Bert variant that can produce meaningful sentence embeddings useful in common semantic textual similarity tasks, to produce embeddings for raw text user queries. Then we can find similar user queries directly by comparing cosine similarity of embeddings. Inspired by the fact that entities in a sentence are often strong indicators of its semantic meaning, we tried improving SBERT performance by emphasizing entities in input texts leveraging named entity recognition techniques. In our experiments, we actually found that simple preprocessing of input texts by prepending entity tags strings in front of entities can visibly improve model performance in identifying similar feature queries.

CloneBot: Personalized Dialogue-Response Predictions

img
Our project task was to create a model that, given a speaker ID, chat history, and an utterance query, can predict the response utterance in a conversation. The model is personalized for each speaker. This task can be a useful tool for building speech bots that talk in a human-like manner in a live conversation. Further, we succeeded at using dense-vector encoding clustering to be able to retrieve relevant historical dialogue context, a useful strategy for overcoming the input limitations of neural-based models when predictions require longer-term references from the dialogue history. In this paper, we have implemented a state-of-the-art model using pre-training and fine-tuning techniques built on transformer architecture and multi-headed attention blocks for the Switchboard corpus. We also show how efficient vector clustering algorithms can be used for real-time utterance predictions that require no training and therefore work on offline and encrypted message histories.