I am a Computer Science PhD candidate at Stanford University. My background is in Data Analytics, Data Visualization, and Human-Computer Interaction. My thesis focuses on extracting medically-relevant insights from patient-authored text related to substance abuse. You can watch my thesis defense here.
Jeffrey Heer is my PhD advisor. I also collaborate closely with Sonal Gupta, Christopher Manning, and Anna Lembke. In the past, I have worked with Monica Lam and Sudheendra Hangal on social graph structures and topologies, and with Margo Seltzer on health-related text mining.
Insights from Patient Authored Text : From Close Reading to Automated Extraction. Diana MacLean. PhD Thesis. 2015.
Millions of people collaborate online with others who share their health concerns. In the process, these users perform complex health-related tasks, such as differential diagnosis and treatment comparison. The result is a massive, growing and readily accessible corpus of Patient Authored Text (PAT) that documents patients' behavior outside of the clinical environment. As a result, PAT can provide insights into otherwise obscure topics, such as why patients follow only certain parts of a treatment protocol, or how people self-treat stigmatized conditions such as prescription drug addiction.
Despite the potential value of PAT, attempts to extract medically-relevant insights from it have been limited. PAT is notoriously noisy and challenging to work with, and there is a dearth of methods and tools for processing and analyzing it. Moreover, the specific research questions that PAT can support are not obvious: determining what data PAT encodes, and how these data are encoded, is a challenge in and of itself.
In this thesis, I develop methods for automatically extracting medically-relevant data from PAT. I focus specifically on the topic of addiction: a stigmatized and prevalent medical condition. Building on close readings of source text to inform schema induction, data annotation, and feature engineering, I train classifiers that accurately identify (1) medically-relevant terms in PAT; (2) users' motivations for participating in an addiction-related online health community; (3) users' drugs of choice, and (4) users' transitions through relapse and recovery. Using these classifiers to scale analyses to large PAT corpora, I derive novel insights into the process of addiction, as well as the role that online health communities play in giving users informational and emotional support and, ultimately, in enabling recovery.
In concert, these contributions both underscore PAT's latent value for illuminating poorly understood or clandestine medical topics, and offer viable methods that dramatically improve our ability to realize this value. [Full version forthcoming]
Prescription Opioid Addicts Seek Advice on Opioid Withdrawal from Peers Online. Diana MacLean, Sonal Gupta, Anna Lembke, Christopher D. Manning, and Jeffrey Heer. Pending review. 2015.
Abstract and link for this paper are withheld pending review.
Forum77: An analysis of an online health forum dedicated to addiction recovery. [Honorable Mention] Diana MacLean, Sonal Gupta, Anna Lembke, Christopher Manning and Jeffrey Heer. CSCW 2014.
Prescription drug abuse is a pressing public health issue, and people who misuse prescription drugs are turning to online forums for help. Are such forums effective? We analyze the process of opioid withdrawal, recovery and relapse on Forum77, MedHelp.org's online health forum for substance abuse recovery. Applying Prochashka's Transtheoretical Model for behavior change, we develop a taxonomy describing phases of addiction expressed by Forum77 members. We examine activity and linguistic features across the phases USING, WITHDRAWING and RECOVERING. We train statistical classifiers to identify addiction phase, relapse and whether a user was RECOVERING at the time of her last post. Applying our classifiers to 2,848 users, we find that while almost 50% relapse, the prognosis for ending in RECOVERING is favorable. Supplementing our results with users' own accounts of their experiences, we discuss Forum77's efficacy and shortcomings, and implications for future technologies. [Full paper]
Induced Lexico-Syntactic Patterns Improve Information Extraction from Online Medical Forums. Sonal Gupta, Diana MacLean, Jeffrey Heer, and Christopher D. Manning. JAMIA 2014.
Objective: To reliably extract two entity types, symptoms and conditions (SCs), and drugs and treatments (DTs), from patient-authored text (PAT) by learning lexico-syntactic patterns from data annotated with seed dictionaries.
Background and significance: Despite the increasing quantity of PAT (eg, online discussion threads), tools for identifying medical entities in PAT are limited. When applied to PAT, existing tools either fail to identify specific entity types or perform poorly. Identification of SC and DT terms in PAT would enable exploration of efficacy and side effects for not only pharmaceutical drugs, but also for home remedies and components of daily care.
Materials and methods: We use SC and DT term dictionaries compiled from online sources to label several discussion forums from MedHelp (http://www.medhelp.org). We then iteratively induce lexico-syntactic patterns corresponding strongly to each entity type to extract new SC and DT terms.
Results: Our system is able to extract symptom descriptions and treatments absent from our original dictionaries, such as 'LADA', 'stabbing pain', and 'cinnamon pills'. Our system extracts DT terms with 58-70% F1 score and SC terms with 66-76% F1 score on two forums from MedHelp. We show improvements over MetaMap, OBA, a conditional random field-based classifier, and a previous pattern learning approach.
Conclusions: Our entity extractor based on lexico-syntactic patterns is a successful and preferable technique for identifying specific entity types in PAT. To the best of our knowledge, this is the first paper to extract SC and DT entities from PAT. We exhibit learning of informal terms often used in PAT but missing from typical dictionaries.
BodyDiagrams: improving communication of pain symptoms through drawing. Amy Jang, Diana MacLean and Jeffrey Heer. CHI. Toronto. 2014.
Thousands of people use the Internet to discuss pain symptoms. While communication between patients and physicians involves both verbal and physical interactions, online discussions of symptoms typically comprise text only. We present BodyDiagrams, an online interface for expressing symptoms via drawings and text. BodyDiagrams augment textual descriptions with pain diagrams drawn over a reference body and annotated with severity and temporal metadata. The resulting diagrams can easily be shared to solicit feedback and advice. We also conduct a two-phase user study to assess BodyDiagrams' communicative efficacy. In the first phase, users describe pain symptoms using BodyDiagrams and a text-only interface; in the second phase, medical professionals evaluate these descriptions. We find that patients are significantly more confident that their BodyDiagrams will be correctly interpreted, while medical professionals rated BodyDiagrams as significantly more informative than text descriptions. Both groups indicated a preference for using diagrams to communicate physical symptoms in the future. [Full paper]
Identifying medical terms in patient-authored text: a crowdsourcing-based approach. Diana MacLean and Jeffrey Heer. JAMIA 2013.
As people increasingly engage in online health-seeking behavior and contribute to health-oriented websites, the volume of medical text authored by patients and other medical novices grows rapidly. However, we lack an effective method for automatically identifying medical terms in patient-authored text (PAT). We demonstrate that crowdsourcing PAT medical term identification tasks to non-experts is a viable method for creating large, accurately-labeled PAT datasets; moreover, such datasets can be used to train classifiers that outperform existing medical term identification tools. [Full paper]
GraphPrism: Compact Visualization of Network Structure. Sanjay Kairam, Diana MacLean, Manolis Savva and Jeffrey Heer. AVI 2012.
Visual methods for supporting the characterization, comparison, and classification of large networks remain an open challenge. Ideally, such techniques should surface useful structural features (e.g., effective diameter, small-world properties, and structural holes) not always apparent from either summary statistics or typical network visualizations. In this paper, we present GraphPrism, a technique for visually summarizing arbitrarily large graphs through combinations of 'facets', each corresponding to a single node- or edge-specific metric (e.g., transitivity). We describe a generalized approach for constructing facets by calculating distributions of graph metrics over increasingly large local neighborhoods and representing these as a stacked multi-scale histogram. Evaluation with paper prototypes shows that, with minimal training, static GraphPrism diagrams can aid network analysis experts in performing basic analysis tasks with network data. Finally, we contribute the design of an interactive system using linked selection between GraphPrism overviews and node-link detail views. Using a case study of data from a co-authorship network, we illustrate how GraphPrism facilitates interactive exploration of network data. [Full paper]
Mining the Web for Medical Hypotheses: A Proof-of-Concept System. Diana MacLean and Margo Seltzer. International Conference on Health Informatics 2011.
As the prevalence of blogs, discussion forums, and online news services continues to grow, so too does the portion of this Web content that relates to health and medicine. We propose that everyday, medically-oriented Web content is a valuable and viable data source for medical hypothesis generation and testing, despite its being noisy. In this paper, we present a proof-of-concept system supporting this notion. We construct a corpus comprising news articles relating to the drugs Vioxx, Naproxen and Ibuprofen, that were published between 1998-2002. Using this corpus, we show that there was a significant link between Vioxx and the concept “Myocardial Infarction” well before the drug was withdrawn from the market in 2004. Indeed, within the Vioxx-related content, the concept ranks amongst the top 3.3% in terms of importance. When compared with the Naproxen and Ibuprofen control literatures, the term occurs significantly more frequently in the Vioxx-related content. [Full paper]
Groups Without Tears: Mining Social Topologies from Email. Diana MacLean, Sudheendra Hangal, Seng Keat Teh, Monica S. Lam and Jeffrey Heer. IUI 2011.
As people accumulate hundreds of "friends" in social media, a flat list of connections becomes unmanageable. Interfaces agnostic to social structure hinder the nuanced sharing of personal data such as photos, status updates, news feeds, and comments. To address this problem, we propose social topologies, a set of potentially overlapping and nested social groups, that represent the structure and content of a person's social network as a first-class object. We contribute an algorithm for creating social topologies by mining communication history and identifying likely groups based on co-occurrence patterns. We use our algorithm to populate a browser interface that supports creation and editing of social groups via direct manipulation. A user study confirms that our approach models subjects' social topologies well, and that our interface enables intuitive browsing and management of a personal social landscape. [Full paper]