Dallas Card

Dallas Card

Postdoctoral Researcher, Stanford University
Email: dcard@stanford.edu
GitHub, Twitter, Medium
Google Scholar

I am a postdoctoral researcher working with Dan Jurafsky and Daniel McFarland as part of the Stanford NLP Group and the Stanford Data Science Institute. I received my Ph.D. from the Machine Learning Department at Carnegie Mellon University, where I was advised by Noah Smith.

My research centers on making machine learning more reliable and responsible, and on using machine learning and natural language processing to learn about society from text.


Updates



Selected Publications


Causal Effects of Linguistic Properties

Causal Effects of Linguistic Properties
Reid Pryzant, Dallas Card, Dan Jurafsky, Victor Veitch, and Dhanya Sridhar
In Proceedings of NAACL, 2021.
Abstract Paper BibTeX

We consider the problem of using observational data to estimate the causal effects of linguistic properties. For example, does writing a complaint politely lead to a faster response time? How much will a positive product review increase sales? This paper addresses two technical challenges related to the problem before developing a practical method. First, we formalize the causal quantity of interest as the effect of a writer's intent, and establish the assumptions necessary to identify this from observational data. Second, in practice, we only have access to noisy proxies for the linguistic properties of interest -- e.g., predictions from classifiers and lexicons. We propose an estimator for this setting and prove that its bias is bounded when we perform an adjustment for the text. Based on these results, we introduce TextCause, an algorithm for estimating causal effects of linguistic properties. The method leverages (1) distant supervision to improve the quality of noisy proxies, and (2) a pre-trained language model (BERT) to adjust for the text. We show that the proposed method outperforms related approaches when estimating the effect of Amazon review sentiment on semi-simulated sales figures. Finally, we present an applied case study investigating the effects of complaint politeness on bureaucratic response times.



With Little Power Comes Great Responsibility

With Little Power Comes Great Responsibility
Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky
In Proceedings of EMNLP, 2020.
Abstract Paper Code BibTeX

Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.



Detecting Stance in Media On Global Warming

Detecting Stance in Media On Global Warming
Yiwei Luo, Dallas Card, and Dan Jurafsky
In Findings of EMNLP, 2020.
Abstract Paper Code BibTeX

Citing opinions is a powerful yet understudied strategy in argumentation. For example, an environmental activist might say, "Leading scientists agree that global warming is a serious concern," framing a clause which affirms their own stance ("that global warming is serious") as an opinion endorsed ("[scientists] agree") by a reputable source ("leading"). In contrast, a global warming denier might frame the same clause as the opinion of an untrustworthy source with a predicate connoting doubt: "Mistaken scientists claim [...]." Our work studies opinion-framing in the global warming (GW) debate, an increasingly partisan issue that has received little attention in NLP. We introduce Global Warming Stance Dataset (GWSD), a dataset of stance-labeled GW sentences, and train a BERT classifier to study novel aspects of argumentation in how different sides of a debate represent their own and each other's opinions. From 56K news articles, we find that similar linguistic devices for self-affirming and opponent-doubting discourse are used across GW-accepting and skeptic media, though GW-skeptical media shows more opponent-doubt. We also find that authors often characterize sources as hypocritical, by ascribing opinions expressing the author's own view to source entities known to publicly endorse the opposing view. We release our stance dataset, model, and lexicons of framing devices for future work on opinion-framing and the automatic detection of GW stance.



Explain like I am a Scientist: The Linguistic Barriers of Entry to r/science

Explain like I am a Scientist: The Linguistic Barriers of Entry to r/science
Tal August, Dallas Card, Gary Hsieh, Noah A. Smith, and Katharina Reinecke
In Human Factors in Computing Systems (CHI), 2020.
Abstract Paper BibTeX

As an online community for discussing research findings, r/science has the potential to contribute to science outreach and communication with a broad audience. Yet previous work suggests that most of the active contributors on r/science are science-educated people rather than a lay general public. One potential reason is that r/science contributors might use a different, more specialized language than used in other subreddits. To investigate this possibility, we analyzed the language used in more than 68 million posts and comments from 12 subreddits from 2018. We show that r/science uses a specialized language that is distinct from other subreddits. Transient (newer) authors of posts and comments on r/science use less specialized language than more frequent authors, and those that leave the community use less specialized language than those that stay, even when comparing their first comments. These findings suggest that the specialized language used in r/science has a gatekeeping effect, preventing participation by people whose language does not align with that used in r/science. By characterizing r/science's specialized language, we contribute guidelines and tools for increasing the number of contributors in r/science.




On Consequentialism and Fairness
Dallas Card and Noah A. Smith
Frontiers in Artificial Intelligence, 2020.
Abstract Paper BibTeX

Recent work on fairness in machine learning has primarily emphasized how to define, quantify, and encourage “fair” outcomes. Less attention has been paid, however, to the ethical foundations which underlie such efforts. Among the ethical perspectives that should be taken into consideration is consequentialism, the position that, roughly speaking, outcomes are all that matter. Although consequentialism is not free from difficulties, and although it does not necessarily provide a tractable way of choosing actions (because of the combined problems of uncertainty, subjectivity, and aggregation), it nevertheless provides a powerful foundation from which to critique the existing literature on machine learning fairness. Moreover, it brings to the fore some of the tradeoffs involved, including the problem of who counts, the pros and cons of using a policy, and the relative value of the distant future. In this paper we provide a consequentialist critique of common definitions of fairness within machine learning, as well as a machine learning perspective on consequentialism. We conclude with a broader discussion of the issues of learning and randomization, which have important implications for the ethics of automated decision making systems.



Show Your Work Figure

Show Your Work: Improved Reporting of Experimental Results
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith
In Proceedings of EMNLP, 2019.
Abstract Paper Code Press BibTeX

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.



VAMPIRE Figure

Variational Pretraining for Semi-supervised Text Classification
Suchin Gururangan, Tam Dang, Dallas Card, and Noah A. Smith
In Proceedings of ACL, 2019.
Abstract Paper Code BibTeX

We introduce VAMPIRE, a lightweight pretraining framework for effective text classification when data and computing resources are limited. We pretrain a unigram document model as a variational autoencoder on in-domain, unlabeled data and use its internal states as features in a downstream classifier. Empirically, we show the relative strength of VAMPIRE against computationally expensive contextual embeddings and other popular semi-supervised baselines under low resource settings. We also find that fine-tuning to in-domain data is crucial to achieving decent performance from contextual embeddings when working with limited supervision. We accompany this paper with code to pretrain and use VAMPIRE embeddings in downstream tasks.



Hatespeech Figure

The Risk of Racial Bias in Hate Speech Detection
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith
In Proceedings of ACL, 2019.
Abstract Paper Press BibTeX

We investigate how annotators' insensitivity to differences in dialect can lead to racial bias in automatic hate speech detection models, potentially amplifying harm against minority populations. We first uncover unexpected correlations between surface markers of African American English (AAE) and ratings of toxicity in several widely-used hate speech datasets. Then, we show that models trained on these corpora acquire and propagate these biases, such that AAE tweets and tweets by self-identified African Americans are up to two times more likely to be labelled as offensive compared to others. Finally, we propose dialect and race priming as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet's dialect they are significantly less likely to label the tweet as offensive.



DWAC Figure

Deep Weighted Averaging Classifiers
Dallas Card, Michael Zhang, and Noah A. Smith
In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (ACM FAT*), 2019.
Abstract Paper Code Blog Post BibTeX

Recent advances in deep learning have achieved impressive gains in classification accuracy on a variety of types of data, including images and text. Despite these gains, however, concerns have been raised about the calibration, robustness, and interpretability of these models. In this paper we propose a simple way to modify any conventional deep architecture to automatically provide more transparent explanations for classification decisions, as well as an intuitive notion of the credibility of each prediction. Specifically, we draw on ideas from nonparametric kernel regression, and propose to predict labels based on a weighted sum of training instances, where the weights are determined by distance in a learned instance-embedding space. Working within the framework of conformal methods, we propose a new measure of nonconformity suggested by our model, and experimentally validate the accompanying theoretical expectations, demonstrating improved transparency, controlled error rates, and robustness to out-of-domain data, without compromising on accuracy or calibration.



Scholar Figure

Neural Models for Documents with Metadata
Dallas Card, Chenhao Tan, and Noah A. Smith
In Proceedings of ACL, 2018.
Abstract Paper Code Tutorial BibTeX

Most real-world document collections involve various types of metadata, such as author, source, and date, and yet the most commonly-used approaches to modeling text corpora ignore this information. While specialized models have been developed for particular applications, few are widely used in practice, as customization typically requires derivation of a custom inference algorithm. In this paper, we build on recent advances in variational inference methods and propose a general neural framework, based on topic models, to enable flexible incorporation of metadata and allow for rapid exploration of alternative models. Our approach achieves strong performance, with a manageable tradeoff between perplexity, coherence, and sparsity. Finally, we demonstrate the potential of our framework through an exploration of a corpus of articles about US immigration.



Proportions Figure

The Importance of Calibration for Estimating Proportions from Annotations
Dallas Card, and Noah A. Smith
In Proceedings of NAACL, 2018.
Abstract Paper Code BibTeX

Estimating label proportions in a target corpus is a type of measurement that is useful for answering certain types of social-scientific questions. While past work has described a number of relevant approaches, nearly all are based on an assumption which we argue is invalid for many problems, particularly when dealing with human annotations. In this paper, we identify and differentiate between two relevant data generating scenarios (intrinsic vs. extrinsic labels), introduce a simple but novel method which emphasizes the importance of calibration, and then analyze and experimentally validate the appropriateness of various methods for each of the two scenarios.



Ideas Figure

Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts
Chenhao Tan, Dallas Card, and Noah A. Smith
In Proceedings of ACL, 2017.
Abstract Paper Blog Post BibTeX

Understanding how ideas relate to each other is a fundamental question in many domains, ranging from intellectual history to public communication. Because ideas are naturally embedded in texts, we propose the first framework to systematically characterize the relations between ideas based on their occurrence in a corpus of documents, independent of how these ideas are represented. Combining two statistics - cooccurrence within documents and prevalence correlation over time - our approach reveals a number of different ways in which ideas can cooperate and compete. For instance, two ideas can closely track each other's prevalence over time, and yet rarely cooccur, almost like a "cold war" scenario. We observe that pairwise cooccurrence and prevalence correlation exhibit different distributions. We further demonstrate that our approach is able to uncover intriguing relations between ideas through in-depth case studies on news articles and research papers.



Personas Figure

Analyzing Framing through the Casts of Characters in the News
Dallas Card, Justin H. Gross, Amber E. Boydstun, and Noah A. Smith
In Proceedings of EMNLP, 2016.
Abstract Paper BibTeX

We present an unsupervised model for the discovery and clustering of latent "personas" (characterizations of entities). Our model simultaneously clusters documents featuring similar collections of personas. We evaluate this model on a collection of news articles about immigration, showing that personas help predict the coarse-grained framing annotations in the Media Frames Corpus. We also introduce automated model selection as a fair and robust form of feature evaluation.



Media Frames Corpus Figure

The Media Frames Corpus: Annotations of Frames Across Issues
Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith
In Proceedings of ACL, 2015.
Abstract Paper Data BibTeX

We describe the first version of the Media Frames Corpus: several thousand news articles on three policy issues, annotated in terms of media framing. We motivate framing as a phenomenon of study for computational linguistics and describe our annotation process.



Media Coverage


"Artificial Intelligence Confronts a 'Reproducibility' Crisis" by Gregory Barber. WIRED (2019).

"The algorithms that detect hate speech online are biased against black people" by Shirin Ghaffary. Vox (2019).


About me


I'm originally from Winnipeg, but I have also lived in Toronto, Waterloo, Halifax, Sydney, Kampala, Pittsburgh, Seattle, and now Palo Alto!

I am an occasional guest on The Reality Check podcast! You can hear me in episodes #466 (biased algorithms), #382 (deep learning), #362 (Simpson's paradox), and #227 (fMRI and vegetative states).

I love to travel and sometimes I write about it.


GitHub Icon Twitter Icon Google Scholar Icon Google Scholar Icon CV