CS224U: Natural Language Understanding

Podcast episode: Maria Antoniak

June 27, 2022

With Chris Potts

Birth narratives, stable static representations, NLP for everyone, AI2 and Semantic Scholar, the mission of Ukrainian Catholic University, and books books books.

Chris Potts:All right! Welcome dear listeners! My guest today on the CS224U podcast is Maria Antoniak. Maria is a PhD student at Cornell, in Information Sciences there. Before that, she did a Masters in Computational Linguistics at the University of Washington, nd before that she was an undergrad in Liberal Studies at Notre Dame.

Maria's work sits at the intersection of a lot of different fields – NLP, digital humanities, cultural analytics, healthcare, and more. Her work is notable for its use of NLP methods to achieve rich new insights into language and culture, and she's been working hard to empower others in adjacent fields to benefit from NLP as well.

In the fall, Maria will take up a post as a Young Investigator at the Allen Institute on the Semantic Scholar team. And she's currently in Switzerland as a Summer Fellow at ETH Zurich.

Maria, welcome to the podcast! Thank you so much for doing this. How are you settling in in Zurich? I hope I'm not keeping you from an amazing hike or a cool jazz club or something like that.

Maria Antoniak:Thank you so much for having me. It's really nice to be here to chat with you, and to take part in this series, which I've benefited from listening to.

Chris Potts:Oh, wonderful!

Maria Antoniak:Yeah, I'm here in Zurich. I'm a Summer Fellow at the Center for Law and Economics at ETH, where I'm working with Elliot Ash this summer, who's a professor here at the Center. And we're working on some text-as-data projects related to narrative. We're both interested in how to define what narrative is, how we can measure narrative. And if we can measure narrative, how we could use such a metric to study other types of data sets.

So, that's broadly what I'm interested in, and working on this summer. But it's also just really great to be here in Zurich and meet a lot of new people. I didn't really realize ahead of time, but Zurich has a big Google office, multiple universities, so it feels a bit like an academic crossroads. And there are a lot of wonderful people here who – maybe, I know their papers or their internet presence, but never have had a chance to meet in person. So that's been really wonderful.

I haven't been going to many jazz concerts, but definitely hiking a lot. That's been my plan every weekend. It's been really good weather, and I'm just enjoying also being in Europe after a long time in lockdown in the U.S. And remembering many nice things about being in Europe, like public transportation.

Chris Potts:It does seem like a thriving NLP scene there too, that's great. So you can have long meditative hikes, while you talk about deep things related to language and culture. That sounds wonderful. How long will you be there for?

Maria Antoniak:Just one month.

Chris Potts:But that's good. That creates some urgency, because then everyone has to meet with you right away, as opposed to thinking that they can just be casual about it.

Maria Antoniak:Exactly. That's the plan.

Chris Potts:And is the connection with Elliot Ash something new related to law and economics, a new dimension to the narrative work?

Maria Antoniak:We had met before. There's a conference called Text as Data. That's actually a plug for Text as Data. I'm not sure when exactly the podcast will go up, but the deadline for the next Text as Data conference is coming up soon. And actually my advisor, David Mimno, will be hosting it this year at Cornell Tech. It's a really friendly community of people, a lot from political science and journalism, I would say, who use NLP methods in their research, as a tool in their research.

We had met there before and it just happens that we have this shared interest in narratives, and stories, and how people frame narratives and stories. And you could see how, from a political perspective, that might be interesting. And then also from my perspective – more from digital humanities – how that might be interesting. And we have a shared interest in trying to develop that kind of tool.

Chris Potts:I mean, absolutely. Storytelling around current events seems absolutely crucial to understanding them and thinking about their trajectory and so forth. Are there any particular events that you all are focused on now?

Maria Antoniak:No, not right now. Right now, we're focused more on narrative at a higher level. And again, how can you measure narrative-ness of a piece of text. But I definitely have thoughts in mind about where I might want to try out such a metric. For me, that's definitely healthcare stories, online healthcare communities. Being able to pick out, for example, places where people are telling stories, or journaling, would be to my mind really useful.

Chris Potts:Oh, absolutely. So the primary thing right now is to have good tools for detecting and characterizing? And then you've got some information that could be used for lots of different purposes. Is that the vision there?

Maria Antoniak:That's the vision right now. But we'll see where it goes.

Chris Potts:Well, you've already been successful at this, right? We could talk a little bit about your paper on narratives around pregnancy, highly topical. Do you want to say a bit about it?

Maria Antoniak:Sure, yeah. In that paper, published at CSCW a few years ago, we explored people's true narratives of giving birth, as shared on a sub-Reddit called r/BabyBumps. And that sub-Reddit includes a lot of content about pregnancy and childbirth. But in this case it was easy to pick out the stories, because they were labeled in the post title as a birth story.

I think it's useful to think of these birth stories as a particular genre of writing with its own conventions. And a lot of our project focused on that. What are these narrative conventions of these birth stories in this community? Can we automatically extract the narrative structure, the shared narrative structure from these stories? And, once we have that shared narrative structure, can we look for outlier stories – stories that deviate from the norm in this community. Or rather, maybe I shouldn't say norm, but from the average in this community.

In the second part of the project, we looked at power relationships between the people who are discussed in these stories. Often there's a nurse, or a doctor, partner, family members. And we know that birth can often be a disempowering and even traumatizing medical experience. And negative experiences have been tied in prior work in healthcare literature to a lack of control. So we're really interested in who was framed by these authors as having control or lacking control. And they framed themselves as lacking control, and they framed the doula, in particular (if a doula was included in the birth, which isn't always the case), as having a lot of power in these stories. So that in broad strokes is what that project was about.

Chris Potts:Fascinating. I'm interested in the discoveries about the outlier stories, because I guess I have an intuition that, in literary history, say, the outlier stories, the ones that are parodies of the norm and things like that, tend to be the ones that survive and seem great to us. What's special about the outlier stories in this work?

Maria Antoniak:Well, what we do is we examine the words used in the post titles, the bigrams used the post title, specifically, because we had noticed that the authors of the stories used the post titles as a kind of tagging or labeling for the stories. Not a official flair system, but kind of, organically, they have come up with labels that they insert into the post titles, as the title's framing for the whole story.

So we looked at which bigrams were correlated with more or less likely narrative sequences in these stories. In the stories that had less likely narrative sequences, the bigrams and the post titles were usually things about negative experiences. They might say "negative", a "negative story", unplanned events like "emergency C-section", or "emergency surgery". And more surprisingly, perhaps, what really surprised me was a bigram that was strongly associated with these stories was "happy ending". So what is otherwise a fairly negative list of terms also includes "happy ending". And I think that reflects, perhaps, a shared desire when telling these stories in this community to reframe what may have been a scary or negative experience in a way that emphasizes, perhaps, that the author persevered. Or, in the end, there was a happy ending, despite everything else that happened. And the author is taking back control of the narrative.

Chris Potts:That's fascinating, because I guess that's a very common literary arc as well, right, that those authors are engaging with? Are those stories also the ones that you can detect as most popular in some sense, or are they just outliers in your technique?

Maria Antoniak:We didn't at all measure the popularity of the stories. There could be a whole follow-up study looking at the number of comments, the type of comments, how many up votes, how many down votes each of these stories received, but we didn't explore it in that work.

Chris Potts:I was just thinking that there could be two sides to that. On the one hand, we're probably attracted to these stories that are unusual in some aspect. And the happy ending is always nice, especially if it emerges from the depths of despair. But those stories might not be representative of whatever we're looking at. So as we move into law and economics, where the average story might be the one that we really want to think about, it could be distorted by artful storytellers, for example.

Maria Antoniak:I think what is actually universal to all these stories, or the vast majority of the stories, regardless of whether the narratives are likely or unlikely, is this dramatic arc, dramatic tension. And it's what makes the stories really interesting to read. That's how I originally became interested in these stories myself. I started reading them on my own and was really gripped by them, and by the tension and the narrative arc.

To me, they read almost like mystery novels. They're formulaic. You have a broad sense of where the story is going to start, where the story is going to end. Maybe some pivotal moments that are integral to this genre, and that you know to expect, and that are very satisfying to then read. But you don't know exactly how or which twists and turns this particular story will take. And that's what makes it fun to read. So I think that's true – not just the outlier stories, but most of these birth stories.

Chris Potts:And you mentioned that you were helped in this research by the known topical coherence of the sub Reddit that you were in. Is that a next frontier for you – detecting these narratives when you don't even know what the text is about?

Maria Antoniak:Yeah, that would be very nice to be able to do. But I foresee it being very difficult. But that would be really great if you could just pull it out of Reddit every story, for example, across every sub-Reddit and every topic – that would be great. There are methods to do this, but how well they work across different domains is an open question.

Chris Potts:Well, I assume you're undaunted by the challenge! I think of this as one of the hardest problems in NLP, stretching all the way back to work by Roger Shank on "scripts" and things like that. And then more recent research by Nate Chambers, for example, on trying to detect narratives in Twitter threads. Even finding simple stuff like – this is the normal course of the meals people have in a day – can be very difficult, to say nothing of the kind of ambitious narratives that you're characterizing.

How did you get into the research? Did your advisors warn you that it was going to be difficult?

Maria Antoniak:I mean, to take it back again to the birth story – what made this possible was the domain. Because it's true, script learning, or extracting event sequences, is very difficult. And often you want to do this from novels, which are unconstrained in topic and characters, so it is very difficult. What was powerful about these medical stories is that they have shared biological sequences. Doctors are trained, for example, to follow certain sequences of events or certain procedures. And so, because of that, we're able to do a lot more, and to try out a lot more than we might be able to otherwise. So, it's a plug for this kind of data and why it's so valuable for us in NLP.

I don't remember anyone specifically telling me that it's going to be too hard. But, I mean, it's good to work on hard problems and problems that interest you, I guess.

Chris Potts:Oh, absolutely yes!

There was another piece of your summer adventures that I wanted to ask about, which is this workshop that you did at ICWSM. It looked incredible. Let's talk a little bit about it. First of all, did you reach the audience you were hoping to reach? I gather that you were looking to get outside of standard NLPers with the techniques you were reviewing. Were those people in the audience?

Maria Antoniak:Yeah, I think so. I wish we had had more time to talk to everyone at that particular workshop. It was very well attended, which was great, but it meant we didn't get to talk to everyone who was there. But, of the people I did talk to, we had a really wide range – people from the CDC, researchers of social media, people from industry. We were invited based on that ICWSM workshop to give another one at Bell labs.

This was the latest iteration out of many. So this was originally what we called the "BERT for Humanists" tutorial and focus group. This grew out of a grant, from my advisor, David Mimno, Melanie Walsh, who was a postdoc in our lab when we started, but is now a professor at the I-School at UW.

We had noticed that there are some tools, like word2vec, that have really gotten picked up by people working in the digital humanities and in computational social science. But then there are other tools – and in particular, we were thinking about Transformer-based models like BERT – that have, of course, completely overtaken NLP, but we haven't seen get picked up as much in other fields like the digital humanities. We were a little curious about that. So the first part of this project was a focus group where we invited people working in the digital humanities to talk to us about their experiences with these models. Like, "have you tried to use BERT? What happened? What was hard? If you haven't even tried, why was that? What barriers? Or why did you think it might not be useful for your research?"

So we had this focus group, and then off of that we created that first tutorial, "BERT for Humanists". And then we've iterated on that a few times. ICWSM is a conference for web and social media. So we modified it to be more web and social media focused. We added a new notebook, and all these materials are available on our website. So this new notebook, where we explore distances between the contextualized vectors and compare them to static word embeddings. The data set that we used is a set of Sephora reviews, which is really fun. It's a really fun example. I recommend taking a look.

We also tried to build into this version a few more stopping points, where we would stop and reflect together on, not just learning, like providing this as a tool or method to people working on web and social media data, but also, what can people working on web and social media data offer back to us about this tool. And a particular place where I think we had some good reflection on that was around the social media data that is often used, for example, for pre-training large models.

There's a recent example that was brought to my attention, I think by Nick Vincent's Twitter feed, that Meta's recent largest model is pre-trained on Reddit comments. But only those comments that are part of the longest comment threads. And you could imagine that people at ICWSM, researchers who care deeply about these online communities, might have a lot to say about that, and a lot useful and helpful things to say about decisions like that. So that was one of the focuses of this particular tutorial that wasn't in our previous versions of this tutorial.

Chris Potts:Oh, that's wonderful. We'll link to all this stuff. It'll be fun for me to track it down.

For the first focus group that you did, what kinds of things did they say about their experiences with contextual models?

Maria Antoniak:Of course, infrastructure. For digital humanities and computational social science research, and even for NLP researchers, it can be difficult to get the resources you need to work with these kinds of models. So in the tutorial we use Google Colab. And it's also an intro to Google Colab, and also an intro to what GPUs are. Why do we use them, why do we need them?

So some of it was infrastructure. Some of it was being unsure. And I think this is still a open research question, which is: why would I use a pre-trained model when I am interested in my specific data set that I have hand-curated over a decade? Why would I want to influence my representation of my data set with all this other data that I don't even know what it is? So that was a challenge that I certainly don't have a full answer to. But I think more work on that topic would be really interesting.

And then also again, because there's such a focus on the data set in humanities and social science research, the pre-processing of the data was a bit surprising, I guess. In retrospect, it doesn't seem as surprising, but at the time it was a little surprising. Even during the tutorial, when we showed the slide where we showed the transformation, for example, of a poem – like here's the text of the poem, now here is the poem, truncated and word-pieced, and in the form that it's going to go to BERT. That really, I think, alarmed some people. And it makes sense. Again, when you care deeply about each word that is being used in your data set, I'm sure it feels very un-intuitive to then chop up your data and change it in this way. A lot of the questions we got were like, "But why? Why would you do that?" And those are great questions. And there's been a lot of research afterward on alternatives to word pieces, for example.

Chris Potts:That's so interesting, though. Of course we're all thinking about not having to truncate our data. We all know that's tragic. But the word piece part is interesting because I feel like I would be very happy if one of my main contributions was to just get people outside of linguistics to stop running the Porter Stemmer on their data, which they do all the time, and they seem to accept that it has completely distorted the texts that they started with. And the word piece thing seems much less bad. I mean, in fact, setting aside the truncation, there's really no loss of information in that way. And I think you still get the benefits that I guess were coming from doing the Porter Stemmer. What do you think made them so shocked?

Maria Antoniak:Well, it might just be something about the word piece – when you break things up into the word pieces, they don't necessarily look like what you might think of as word pieces, or as stems of words, in the way that. The Porter Stemmer might look very intuitive, but it might not actually help you out in whatever your task is. And yeah, I was thinking about this earlier when I was thinking about this question, which is that, it is true: in terms of humanities, you're very focused on the data set, and that can also lead you to want to create a lot of customized approaches to your custom data.

Sometimes it's actually more helpful – I think this is where NLP can come help a bit, which is that, actually it can be more helpful to approach it more simply. We don't need to stem things actually, leave the stems on. The stems are helpful. We see that in topic modeling, for example. Alexandra Schofield, one of my former lab mates, who's now a professor at Harvey Mudd, she has a great work on exactly this. Stemming, stop-word removal, and duplicate documents, and how these affect or don't affect topic models.

Chris Potts:That's so interesting though. I feel this too sometimes: Before, they got to do LDA or word2vec, and they could be very thoughtful about how they tokenized and what they kept in, and that could be bringing domain expertise. And I'm thinking that now, what we're going to say to them is, "Don't do anything to your data, but do a lot of very boring stuff with hyper-parameter tuning so you can do some domain specific unsupervised pre-training on your corpus." And you've gone from being a domain expert to being someone whose kind of tending this deep learning machine in some mysterious way.

Maria Antoniak:It's true. And one of our goals in the tutorials was to, at least, build some intuition about those parameters, and how they might relate to the data set. Of course, it's not always so direct or so simple. But at least considering, how large is your data set? How many times might we want to iterate over the data set? What are the dangers of over-fitting? Where the knowledge of the data set is still useful. But it's true: we're entering an era where you can maybe don't even worry about parameters. We just ask GPT-3 what the answer is.

Chris Potts:It makes me think also that maybe we as NLPers need to think more about how you would stitch together the representations corresponding to some word pieces, so that a social scientist or humanist could recover the intuitive word form or place name, if it's multi-word, or whatever it is they need to deal with. And then, the work that they're doing could be on that kind of output layer of whatever BERT has processed.

Maria Antoniak:Yeah, so there's definitely room for more tools, and especially visualization tools. I think, again, one of our parts of the tutorial was a notebook where we show how to visualize distances between these word vectors. I found that part of the tutorial fun and illuminating. And it would be great if there were some tools. I think that's also why word2vec was so successful in getting ported to other disciplines – it was relatively easy to get started with. And there were a lot of public tools and visualization tools to work with it.

Chris Potts:Oh, absolutely. I mean, I was thinking, if we could systematize some of what we know about domain-specific, unsupervised additional pre-training into some kind of software package that other people could use, we would benefit, of course, because I think very little is understood about that, and then they would have a kind of space to work in, to get good representations for their corpus. But that will require tooling and not just insights, I guess.

Maria Antoniak:Yeah. I would love for someone to build stuff like that.

Chris Potts:For the question about uptake – I guess my guess would've been that it just takes these other fields a long time to come to terms with these methods, because the methods are so important. It's like the foundation for all their scientific results. And so, in my research lifetime, it was like, they came to terms with LDA and probabilistic topic models, and did a lot of work with those models. And then they came to terms with word2vec. I guess that was easier, because it's like, okay, more vector representations at some level. And then just when they were coming to terms with that and what the methods would be, we offered BERT, which really is a paradigm shift in how you think about representations. And so it will just be a long time before they use BERT. Is there anything to that?

Maria Antoniak:Yeah, of course, it takes time. Same as it can take NLP time to adopt new things. Part of this might be pandemic time slippage, but we've had BERT for a while now. I mean within our fast-moving field, it's been a little while. And so, whether or not it's literally taken longer, I guess we would have to do some study of how these different methods end up in other places, and at what time after first publication in the other field. But it does seem like there were at least some different reasons why it might be taking a little bit for people to uptake. Again, the increased need for infrastructure, maybe less familiarity with Transformer-based methods, maybe word2vec and some of the variations of word2vec that are based on perhaps easier to understand matrix factorizations. As you said, maybe that related more cleanly to other methods, like topic modeling or LSA, that were already familiar. I think it was worthwhile thinking about why... What were the challenges with BERT? Which we talked about before.

Chris Potts:But don't your social scientist and humanist friends laugh when you say four years is a long time? Some of them might have had papers under review for that whole time!

Maria Antoniak:Of course!

Chris Potts:I mean, I myself kind of feel like NLP would do well to slow down a little bit and be a little more critical in acceptance of new results, headline results, and things like that. We could take a lesson from these adjacent fields.

Maria Antoniak:Of course. There's so much to read and so little time, and it seems I'll never catch up!

Chris Potts:I think, though, even if BERT has some value, these fields will continue to benefit from methods like word2vec and GloVe. Having static representations of words, it's incredibly powerful, words and concepts maybe. And you have a really important TACL paper from... Oh, I've lost track of time ... but with David Mimno, on the stability of methods like word2vec. What's the reception to that been? Has it found the audience you were looking for, this paper? And we could talk about the core results if you want.

Maria Antoniak:Yeah. So, I mean the core result of that paper was that if you are relying on, for example, cosine distances between word vectors from methods like word2vec or GloVe as evidence in your research, it is not sufficient to run the model once, get the distances once, and report ranked lists of words, which was common. These ranked list of words are actually unstable across iterations, and, as we showed, in particular for small data sets, and when we compare bootstrap samples of the data sets. So, if you have multiple copies of a particular document, or if you, in another training run, leave out a particular document, that can have a large effect – or, we'll say, a significant effect on those ranked lists of words.

When we presented that work, there were two other papers at that time that were on similar topics, on kind of different angles to this stability problem, with these distances between word vectors. And I think together those papers have had the effect that I think we wanted, which was to encourage people to think more about the stability problem, and to report more – measuring stability or reporting significance of these results, rather than only reporting the ranked list of words.

So I think it has reached the audience that we wanted it to reach. It was my first real published work. And so, it was interesting to see people cite the paper, and get a notification that someone has cited your work. And then you go and look at their interpretation of what you wrote in your paper. This is probably a universal experience, but it was not always, perhaps, what I was trying to communicate. And I think that was a good lesson about science communication.

When we first publicized that paper, I wasn't as active on Twitter. Now I would think of doing a thread, and trying to really highlight the key takeaways in that thread, because, like it or not, tools like Twitter can help us get that core message out.

Chris Potts:Well, I'm curious about this now. What were the reactions, the takes on your paper that shocked you?

Maria Antoniak:No, there was nothing that shocked me. Just that, maybe, the take-away would be slightly different, or the emphasis might be a bit different. For example, stability. We suggest that there's instability across these different runs. You could frame that as a problem. Like it's bad that there's this instability. Or you could frame it as just, there's instability and we need to report measurements on it. These are different framings and it will depend a bit perhaps, again, on how focused you are on a specific data set.

Take what I just said about bootstrapping. When I've described that to a few people and talked about this paper, again from the digital humanities and social sciences context, they say, "Why would you leave out parts of your data set that you've carefully curated?" And that's a really interesting perspective. I don't know that it's right or wrong. But just, it's not what I had in mind when I wrote the paper.

Chris Potts:That's what I was anticipating: lines and papers that say, "we do not adopt word2vec and related methods because of their instability." And then they cite you. And then they say we're going to use a standard unigram language model to estimate something about word relationships. And you're kind of broken-hearted because there was something better.

Maria Antoniak:Yeah. And maybe there's a bit to be said there also about evaluation papers, and papers that make some critique. It doesn't mean I don't like word2vec! I really like word2vec! It's a great tool! It's output can seem magical in some ways. It's great. But it's not magic. We need to maybe evaluate our results. That's what we were trying to say. Not that word2vec is bad. I don't think this happened so much in this particular case, but maybe there's a glimmer of that, of how critique papers can be received.

Chris Potts:I'm interested to do a little bit more of a deep dive on all these different communities you interact with. Because my take about this is, social science, that's easy. We have a lot of kind of common methodologies and goals and so forth. Harder for me are humanists, because I think very often humanists are not driven by quite the kind of scientific hypotheses that we are. And that can lead to a kind of disconnect, as they're searching for something that's actually quite different from what I presume their goals are. Have you encountered any of that? Or what are the goals of the digital humanities?

Maria Antoniak:Oh, well I'm not going to try and answer the last question, because someone will certainly try to debate me! But, I would say – about how the goals are similar or different to goals, in NLP for example – they're somewhat the same and somewhat different. So one thing that we've already talked a bit about is pre-processing and a strong focus on the data set, and thinking hard about pre-processing in a way that we are doing less of now in many parts of NLP. And wanting to try out, for example, highly customized pre-processing methods like stemming.

Another thing that I've thought of is that, I think in the humanities there's a stronger sense of being invested in a single, or in a particular, data set. And that you're situated in that data set. That's the context you're working from. And that actually reminds me also of, more HCI or ethnographic approaches as well. Where again, you're really embedded in a particular context. This is on my mind because we have a workshop paper that we're going to be presenting at NAACL this summer on narrative and NLP and HCI approaches to narrative. As we were working on this from the HCI perspective, I was also thinking, well, this is also reminding me of the digital humanities perspective, caring a lot about a particular data set. Whereas in NLP, well, we might have a particular data set, but it's more that it's a shared data set, so I'm not necessarily invested in that data set, other than that I want to test my model on it and build some abstraction. I'm not necessarily trying to learn about that data set.

Another difference is in the goals. I think often in the humanities, there might not be one right answer. You might not be seeking an answer, but rather you might be exploring a data set. And your goal might instead be the creation of a lens or a view of that data set, in the midst of many other possible lenses and views of even that same data set. And that feels different to me. And I think that is the key, thinking about methods like topic modeling. What's the point of topic modeling? Well, there can be different points to topic modeling. But one of them, I think, is this more exploratory goal. You're creating a view of the data to explore. Asking whether or not it's the right view is maybe not the right question to ask.

Chris Potts:Oh, that's tremendously helpful to me – that idea of a lens, of a new way of thinking amongst many others. That's what I was thinking. That's why this can be difficult. You press the humanist:
"But what's your claim about 19th century Irish novels?" And they say, "No, I want a new way to experience and read them and talk about them with people." "But what's the core hypothesis?" It just feels like you're talking past each other. And I think that's right, this is a new way of experiencing the works that our methods can offer.

Maria Antoniak:Yeah. But we can end up using the same tools for these very different purposes. And when that happens, some funny things can happen. So, with the word vectors, if the word vectors are meant to be used as features downstream, well, maybe it doesn't matter as much if they're unstable. If we're using them as evidence about the biases of the authors of a particular corpus, now we might want to think very carefully about their stability, for example.

This also makes me think about reproducibility. I'm thinking about these shared data sets in NLP versus these particular data sets in the humanities, and maybe the need for some more reproducibility in the digital humanities. It's very difficult. I mean, it is just difficult to create shared data sets. A lot of data is under copyright. So that's an area where I again see a difference, and maybe some room for people to work together on creating some new shared resources.

Chris Potts:Right. Is evidence different though also? I mean, of course I want to reproduce your research if I'm a digital humanist. But I guess I could also just say, as an aesthetic experience, this just doesn't resonate with me. It's just not how I read these novels or it's not what I think was intended by the authors or whatever. And on those grounds, say that the result doesn't have value. Which would be very different from us just saying, "Well, what's your evidence?" Right?

Maria Antoniak:Yeah. I mean maybe we could reformat that as domain expertise, which is something I think we can understand. It's a little bit difficult. Can you get three English professors in a room to annotate your data? That's challenging, but you can do that. I mean it's domain expertise, even if it's an aesthetic judgment, and you could query people for that.

Chris Potts:That sounds like a fascinating process. So do you think they would agree to annotate based on aesthetic experiences? You'd have to give them a very wide open vocabulary, right? It can't be a five-star scale, I'm guessing?

Maria Antoniak:Ah, yeah. I guess it depends. We would need to be more specific. I mean, there's work on measuring suspense, for example. There it's a little bit less. I don't know that they have used expert readers, quite as expert as this, but there's definitely been work on measuring reading experiences and these more emotional responses to literature. I don't know off the top of my head of a work that has done that has labeled via a large cohort of humanities professors. But maybe it's been done and I'm just not aware of it.

Chris Potts:Your remarks about data are interesting to me also, because I was thinking, in NLP, for a lot of things people do, we like to pretend that there aren't actually people represented in the data. It's just like some anonymous process has produced a lot of texts that we benefit from. And to the extent that we attend to the individuals, it's kind of like, well, we just think of them as representations or data points. Obviously digital humanists are not going to do that with their novels. And I was thinking that your research might also tend toward the humanists in that regard. I'm thinking in particular of the birth stories work, where certainly the people that are represented are kind of really central to what you're doing. Do you think about that data set differently, or your own data, differently as a result?

Maria Antoniak:Yeah. My view is that we shouldn't forget the individuals and the authors of the data sets, and there are ethical reasons to do that. And there can also be engineering reasons to not forget about them.

From the ethical side, there's a whole thorny question about, when it's okay to use data that's posted publicly online and how consent works in those situations, just as one example. And, in those cases, I think that maybe this again goes back to caring about the individual data set is also caring about your individual research problem. So taking time with your research team to think through the research question, the research goals, and the data set, and the spectrum of possible benefits and harms. We discussed this in the birth stories paper, and we use the principles from the Belmont report as a framework to think about those risks and benefits in that particular case, to those particular authors of that data set.

It would be great if there were more IRB guidance to work through that. I know a lot of places that's still not the case. But I think it's really important to do, especially when, in the case of the birth stories, you have individuals who have important stories to tell and are taking a lot of time and effort to try to communicate those stories. For example, if I had created a copy of that data and put it somewhere else to make it available to other researchers without the knowledge or consent of those authors, I don't think that would've fit either with the goals of that community or my own research goals. So, in that sense, it's very important to keep the individual authors in mind.

Then engineering-wise, questions about representation and bias. Are we modeling what we think we're modeling. If we don't know who the authors are – if we have no care for who they are or what our data actually looks like – it's going to be really hard to answer those questions.

Again, how stable our results? How can we characterize the data points? And in particular, I think we should know as computational linguists that it's not enough to just report averages in our papers. We also want to see the data and see examples of the language that we're modeling. So zooming back in on individual examples, individual stories, individual texts, where possible, can help validate and enrich our quantitative results. In both these framings, I think individual stories and keeping track of individual authors is really important.

Chris Potts:Oh, that's great. So it sounds like you're balancing the pressure for open science and the importance of privacy and respect for participants by kind of finding a way to disclose what the scientist needs, while in open dialogue with the people who contributed the data, or something like that?

Maria Antoniak:Yeah. It's a tension. And exactly, each project you will have to find that balance for that project.

Chris Potts:And the thing you said about consent is so important. Our field is going to have to reckon with the fact that it is false to assume that if someone put something online that was implicit consent to be part of a scientific experiment. And you can tell that people out in the world actually believe the opposite. They do not regard their public posting as consent to be part of a science experiment. In lots of communities, people are objecting on GitHub that their code is being used to train models. And people object when they've discovered they were part of sort of a social sciences NLP experiment that manipulated them in a forum or something, right?

Maria Antoniak:Yeah, I think it's really complicated. Which could sound like a cop out, but there's great work from, for example, Casey Fiesler lab at Colorado, where they survey people about their preferences in different settings. I think they have a paper on Twitter users, and also a paper on fan fiction authors, for example, where they ask them about their preferences. "Did you know that your data might be used in this way? How do you feel about it?" And it's actually... I was surprised at the amount of nuance in people's opinions, and the range of people's opinions. And the different cases in which it might be okay to use their data, or it might not be okay to use their data.

Who has control of a particular data set? Is it the moderators of the community? Is it a majority vote of the community? Does each person in the community need to individually give consent? And do we even have the mechanisms to do any of that as researchers? There's a lot to think through, and I would say it's complicated.

Chris Potts:Mm-hmm (affirmative). There's one other group that I'm curious about that you interact with, and that's journalists. And so you've done some workshops that are trying to help journalists in some way. And I was wondering if you could talk about that, and what their goals have been and things like that.

Maria Antoniak:Yeah. I did this one course with a journalism school, and I tried to give them some tools. It was supposed to be an intro to NLP. Here's how to train a classifier, and run a classifier on your own data set. Here's a topic model and how you might use it to explore a large and unfamiliar data set. But I was also trying to give them some critical lens to view NLP and machine learning research. And hammer home that, as journalists, please don't just accept at face value what researchers, or especially tech companies, are telling you about this work.

In that particular tutorial, I included a bunch of resources and public sources and publications on data and AI, ethics and critical studies of these fields. And I also talked a bit about, of course, stability of results and significance tests. But also a type of storytelling that can arise, especially with these unsupervised methods. Where, it can be easy to convince yourself that what you're seeing is what you wanted to find. But if you had, again – like with our ranked list of words – if we had come up with a different ranked list of words on the next training iteration, we may have come up with an equally plausible story.

Within a two-hour workshop, I tried to build some skepticism of these methods, while also showing how cool and useful they can be and why we're excited about them. I don't know if I completely succeeded, but that was my goal. And reporting on these topics often frustrates me. So, that's why I was trying to communicate a bit about this skepticism.

Chris Potts:Yeah. I feel all of those tensions. And I guess I would be unsure... Were they trying to use NLP methods to do journalism? Or were you trying to make them more skeptical consumers of news about our results and things like that? Or was it a mixture?

Maria Antoniak:Well, the goal, I think, was more of the former. But then I was trying to do both, or at least weave in pieces of the second more critical view. As a consumer, how you should take in these statements from tech companies. Or to not be intimidated when they say that we use this complicated model. It might not actually be that complicated. And I've given you the tools in this workshop today to actually understand some of those models yourself, and think through whether what these companies are saying make sense to you.

Chris Potts:I guess part of that would have to come with kind of demystifying the models for them. Showing them that this thing that was described in very colorful terms is actually just doing this set of multiplications of various numerical representations. That you should remind yourself that at the bottom, it's just a bunch of those very boring, mechanical things and not what might be in their imaginations, which is something more like science fiction.

Maria Antoniak:Yes, definitely. I've run a similar, just kind of like small intro to machine learning for some of the other PhD students in my program. I did this a few years ago. And my proudest moment was – I was trying to do exactly this, like break down some of these models, make them not intimidating, just some linear regression, logistic regression, Naive Bayes – and so, I was so proud when later one of my friends, they had this light bulb: "Naive Bayes is really naive." "Yes! You got it! You got it!"

Chris Potts:Well, I've done that too. And I feel proud when people say things like, "So all that's happening is I'm learning this vector of weights, that's it?" Because you can tell then that they thought so much more was happening as part of this process.

Maria Antoniak:And it's funny because, on the one hand, to me as a researcher in my chosen fields, I think these methods are cool. I think they're exciting. And actually, in their simplicity, that is what makes them even more exciting in some ways. So, teaching is hard. And I don't know that I've found the right balance here between, "I want you to be excited about the tool!" and also "I want you to be skeptical about the tool." I don't want you to be afraid of the tool. I want you to be confident in approaching it and looking up more information after this tutorial, and learning more, and diving into the math. But I also don't want you to be so overconfident that you are maybe misunderstanding something, or missing a whole other branch of this field that we didn't have time to talk about today, for example. It's hard.

Chris Potts:I totally agree. And I mean, these are very ambitious teaching situations, where you're trying to reach people with lots of other interests very different from our own. We can't presume anything about what their goals are. I mean, that's why it must be fascinating to just hear what their questions are, and how different their questions are from what you would get if you were just at one of the ACL conferences reporting on this research.

Maria Antoniak:Definitely.

Chris Potts:So your next stop professionally is Washington, for the Allen Institute, where you'll be working on the Semantic Scholar team, right? Do you have a sense for what that work will be?

Maria Antoniak:Yeah, I'm really excited. I've always wanted to spend some time at AI2, and I guess this is going to be the time, finally. And I've always been really excited about Semantic Scholar. I was actually an early user-study participant for Semantic Scholar before I started my PhD, I think before I'd even applied for my PhD. And I've used it in my own research a lot, in gathering papers and writing lit reviews, and just kind of exploring and reading. So I'm really excited.

My work will, I think, be an extension of my current work in NLP and cultural analytics and will cover multiple topics. Some of that will definitely, I hope, involve Semantic Scholar data, whether the publications and the citation graph, for example – which, there's a great API for the citation graph, for anyone interested.

But also the community of scientists using the website, I'm really interested in. And I'm a big fan of productivity tools in general and organization tools. I have a few blog posts on my website, for example, about this. It's kind of maybe partly a hobby interest, so that it intersects with research. And I'm excited to also be thinking about Semantic Scholar in that sense, as this tool and how we can make it more useful for the community.

I'm also excited to meet other people at a AI2, and the other teams. AI2's grown a lot. It's pretty big now. And there's also the entire University of Washington community next door. And, I'm excited to work on a bunch of projects, on my usual healthcare books, stories, and online communities.

Chris Potts:I think Semantic Scholar is wonderful. I'll confess to you that I have to break a bad habit of going to Google scholar. It's like I go there without even thinking I've gone there, and type things in. And then I just feel it's not the right thing to be doing. I don't like the BibTeX that comes back and Semantic Scholar gives me so much more information. Maybe I should just change... Maybe it's just a fact about how my browser is set up. And I'll change my habit if I change the button I click.

Maria Antoniak:You need to type "se" not "sc" into your auto complete. That's it. It's a one letter change.

Chris Potts:I really think it just seems like such a vibrant place. And it seems like there's a lot of new things that are happening with the product. And there's so much potential there for empowering people, and maybe getting them to break outside of their citation networks, and things like that I think could be really transformative for the field.

Maria Antoniak:Yeah. They've had a lot of work on exactly that. Breaking out of citation networks, and also accessibility and accessibility tools. Accessible readers, for example. That's really important. Really, really important. These papers, they're not just read by us. They're also read by people who are interested in these topics.

AI2 and Semantic Scholar, I think were involved, for example, in a COVID-related effort. To create COVID-related paper data sets, to increase accessibility for researchers. But you could also imagine how we all might be interested in those papers, and in understanding and accessing those papers. So, really important work. I don't know how much my own work will touch that accessibility part, but I'm a big fan of it.

Chris Potts:That COVID data set effort was incredible. CORD-19. Boy that came out like... Could it have been May 2020? [It was actually March 16, 2020!] I mean, almost instantly. And then they've been iterating on it, and it's been growing and changing. Really incredible.

Maria Antoniak:Yeah, really inspiring work. Maybe as linguists and natural language processing people, we don't always have so many opportunities to directly and quickly help others. And it's really great to see our tools used in a good way.

Chris Potts:Absolutely. I had some questions that are more about you, is that all right? Finish up that way?

Maria Antoniak:Yeah, definitely.

Chris Potts:At your website, you've got a blog, and you've got a list of books that you've been reading and some other links to tools and stuff like that. And I was intrigued by all of it.

For the book list, I was wondering whether you sense any themes running through the list of books that you've been reading lately. But that's my first question for you. What's going on with these books? Besides the fact that it looks impressively serious as a reading list.

Maria Antoniak:Yeah, I like having my list of books there. I mean, I hope other people enjoy it. The intention is, it's just something fun, not really research-related, about me. I love when other people have reading lists on their websites, and when it's not necessarily research related. Just the novels that they're reading, non-fiction that they're reading, that's not related to NLP. So, I enjoy seeing it when other people do. And I also like looking back on my reading. I don't log everything. I read a ton of Agatha Christie's that I do not log on that list. So there're some things missing. And I also don't usually write research related texts in there.

But I write them in order. And, if I were going to characterize the whole list, it's a lot of literary fiction, especially psychological fiction. That's just my personal preference and what I like. And I tend to look to smaller publishers to curate things for me. So, for example, I love the Pushkin Press. I love the New York Review Books press, which is a publisher connected to the Review magazine.

I do notice some more specific patterns. You can see on that list in early 2020, there's a spurt of books all about people trapped inside houses. And it wasn't intentional at all. But looking back, that's what I chose to read. It was about Queen Elizabeth under house arrest, siblings trapped in a haunted house. And We Have Always Lived in the Castle. A World War II diary of a woman maintaining her family's home while being bombarded. And an Iris Murdoch novel about a sinister castle by the sea surrounded by inhospitable marshes. So, there's definitely micro themes in there.

Chris Potts:I'm just looking ahead to see whether, in 2021 and 2022, apocalyptic themes become more prominent, or something that I should worry about.

Maria Antoniak:Not yet anyway. Not yet. I am on a micro theme right now, which was again accidental. It's three novels about older men reflecting back on their lives. The most recent was the Posthumous Memoirs of Brás Cubas, which was wonderful, and I highly recommend it. I hadn't heard of this author before, but it's a really fun and whimsical novel about this man's reflections on his life.

Chris Potts:I'm almost sorry that the Agatha Christie's aren't all represented. If I was going to pick one Agatha Christie to read, what should it be? Oh, and my follow-up question is, is it an outlier among the Agatha Christie set?

Maria Antoniak:This is a question I would need to prepare better for. I think the natural starting place with Agatha Christie is always, And Then There Were None. It's the classic, and it is very good and very scary. And if you like it you'll probably like her other novels. And if you don't like it, probably you won't like the other novels.

Chris Potts:Are they all in a hotel on an island?

Maria Antoniak:Yes.

Chris Potts:Off the coast of Wales or something? And one by one... Well, I guess I shouldn't... Yeah, okay. I think I've read that.

Maria Antoniak:Mm-hmm (affirmative). I found it terrifying.

Chris Potts:So it is an outlier, because it's not a quiet town where there happens to be a murder. It's something much spookier. Okay.

Maria Antoniak:Yes, it is an outlier. Mm-hmm (affirmative). I would need to think... If anyone listening wants a curated, Agatha Christie reading list, I'm happy to provide it, later.

Chris Potts:Well, we can link to that one. I think that's a great one. One question that I did give you by email. I'm curious if you have an answer is, a recommendation for our colleagues in NLP that is not science fiction or fantasy. Something they should read that might teach them about the human condition in the present moment or something.

Maria Antoniak:This was a very difficult question. I think if we were allowing sci-fi, it would be an easy answer, in my opinion. I really love Stanisław Lem, his sci-fi work. And also more recently, Jeff VanderMeer. Both of them write about scientific processes, about communication, or breakdown of communication. Measurement and breakdown of measurement, that I think is really good to contemplate, as a scientist.

But, if I can't recommend sci-fi, this is harder. I mean, if you're interested in language, then of course, Italo Calvino and Borges, two authors who play with language, and use language to contemplate language. Of course, you should read both of them.

A lot of my reading isn't directly connected to NLP. It's more about life, and living, and decision making. And in that case, the best advice is to read... Well, I would say read fiction. I know a lot of people don't like fiction, or don't read fiction. I think fiction can be very serious and very important to read. And also try to read books by people different from you. That's the joy of reading.

Chris Potts:The Borges recommendation is perfect. That's almost like the literary world's version of xkcd, where there could be a Borges story for every talk intro that you want to have. Those short stories, those are just wonderful.

Maria Antoniak:Yeah, they're fun. I think most people would enjoy them. And again, especially if you're interested in language, I think you'll enjoy Borges.

Chris Potts:And Stanisław Lem. I mean, that's a wonderful set of recommendations. I don't know that work well. What I'm really looking for, I guess, is something that will not make us feel like we're the most important people in the universe, because our technology has great gains or catastrophic consequences for the world. But rather, put us in our place, so to speak. There's a lot happening that's not related to science and technology that's relevant to people. You know what I mean?

Maria Antoniak:Yeah. I mean, well, maybe go read War and Peace. I mean, I love War and Peace. And I love Middlemarch. These are two big novels. If you want to be taken out of yourself and contemplate the world and what's happening in the world, how the world moves, how people move in the world, how you can make decisions, of course, go read both of those.

A recent book that definitely shook me up, and I think everyone should read, is Beloved by Toni Morrison. It's a great novel, it's a difficult to read novel, or I found it very difficult. There's also an audio book. So actually, I listened to the audio book and she reads it herself. Which is really moving. And she has a beautiful voice. So, everyone should listen or read that book, definitely.

Chris Potts:Oh wonderful!

And I warned you that I guess I have a weird obsession with Cormac McCarthy. My excuse for that, is that he does have a wonderful piece that is advice for us as scientific writers. And the central piece of the advice is essentially to make sure that you're telling a story. Which, when I say that to people, they often think it's a sort of cynical take. But I think this is absolutely just essential to communication – to have a narrative. Have you seen that piece?

Maria Antoniak:I have, but a long time ago. I should go reread it. I remember it being really helpful.

Chris Potts:It is helpful in the sense that what he says is, "Figure out what your story is, and then make sure that everything that you say connects with that story." And when you get in the rhythm of that, it is really clarifying. Because for every sentence you're going to write, you think, is it connected or not to my story. And it just helps you kind of make decisions. I found it really useful.

Maria Antoniak:Do reviewers buy this writing style?

Chris Potts:Well, I think they're... That's the part that could sound cynical to people. Which is, it's just inherently easier for people to understand you, and they'll find it more compelling, if you are telling a story. And that sounds like we're manipulating these poor reviewers. But in fact, I actually just think this is part of what it means to be human. So, we should embrace it.

Maria Antoniak:Clarity is good. And at the end of the day, it's not just good for the reviewer, it's good for you too.

Chris Potts:Exactly. We owe it to our ideas to really think carefully about how we're conveying them. The communication part is essential.

But I brought up Cormac McCarthy here because you read two books by him that I don't really like. I don't know what you think about me not really liking them. But those are No Country for Old Men and Blood Meridian.

Blood Meridian, I've read a few times, because I had a college professor who clearly thought it was a masterpiece. And I guess I'm open-minded about that. But every time I finish it, I think, "Well that was sort of pointlessly violent, with a lot of very obvious kind of verbal ticks embedded into it that I find quite distracting." I don't know. What did you think of these books?

Maria Antoniak:Yeah, I have not straightforward feelings about Cormac McCarthy either. So, what I always say about myself and Cormac McCarthy is that I can only read one every five years. So I've actually read one more, which was before I started my list, which was The Road. So I read The Road, Blood Meridian, and No Country for Old Men, and his books are dark, violent. And I think at times, or depending on your interpretation, despairing. And that can be overwhelming to me while reading.

That said, recently, I think because I had just finished No Country for Old Men, I was thinking back on his other novels that I've read, and how unpleasant the reading experiences were at the time. But actually in my memories, what stuck with me from those novels isn't the violence necessarily, but actually more of the dramatic tension, the high points of the story, the desert scenery, which is gorgeous. I grew up in the desert, and I love that scenery, and I love his descriptions of that scenery. And then of course this clear moralistic vision against this chaotic world that's spun out of control. That's actually what has stuck with me more than the violence.

So even though it was unpleasant while reading, the things that stuck with me were not those most unpleasant things. In comparison to other writers who write beautifully, and it's very easy to read, but you think back on the novel, and actually you remember the ugliest parts. I don't know why that is. Maybe I need to think more about why that is.

Chris Potts:No, I really appreciate that. And you did like, as a book, No Country for Old Men?

Maria Antoniak:So, I love the movie.

Chris Potts:Me too. The movie's tremendously good. Yeah.

Maria Antoniak:And I saw it before I read the book. And the movie tracks so closely to the book. Unfortunately, because I saw the movie before the book, it was a little hard for me to engage with the book. I knew exactly, not just what event was going to happen next, but often what line was going to be said. So that was, it's just a little unfortunate. I wish I'd read the book before I saw the movie. I felt mostly it was a good thriller. I was on the edge of my seat, even though I knew what was going to happen next, and that's a good sign. That's good writing. But yeah, I preferred Blood Meridian.

Chris Potts:For No Country for Old Men, it's even more puzzling, because not only does it align, but the book is written as a kind of screenplay. It has stage directions and stuff in it, in a way that it... Of course the Coen brothers could just literally use it, because it was almost like it had been produced that way. And I guess I just didn't get that.

Maria Antoniak:Yeah. I don't remember the full background of when he was writing. I have some vague idea that this was intentional. It was written because he wanted it to become a movie. But, we need to fact check that claim. I can't remember exactly.

Blood Meridian is different. Blood Meridian is denser, much denser. These long passages, again describing the deserts or speeches that the characters make. These long battle sequences in the desert. And I would say that book shook me much more than No Country for Old Men did. So I'd say overall, he's not my favorite author, but I'll probably read another of his novels in five years.

Chris Potts:Oh yeah. I mean Blood Meridian, as a story, a founding story of the United States, at that level it's absolutely terrifying, and explains a lot. And has weird connections – I mean, this is not accidental – but connections with No Country for Old Men. Which could be kind of a sequel to Blood Meridian. So I guess now, now you're going to get me to read it again. Maybe this time I'll see what a masterpiece it is.

Maria Antoniak:I wonder if I can reread more frequently. I haven't reread any of them. Maybe I should reread Blood Meridian. One thing I do love about his books is exactly that apocalyptic, mythic, parallel universe of Texas and the south, that they're the Southwest. It's beautiful, and dangerous, and it's weird. And it's something that's really his. And I really enjoy entering that world, even if it's very scary.

Chris Potts:We've talked about movies. There's no list of movies at your website. Are you less into movies than books?

Maria Antoniak:I love movies very much. And I was trying to keep a list for a while, but it's much more difficult. I watch more movies than books. And I'm very picky about which books I read, but I'm not as picky about which movies I watch. Or I'm picky in a very different way. There are good bad movies. I'm not sure if there are good bad books.

Chris Potts:What's your take on AI assisted art?

Maria Antoniak:Yes.

Chris Potts:You have this nice blog post that has some output from one of these models. I wasn't sure of the full story actually, but I find it very compelling. I follow a bunch of these artists on Twitter and I just think it's really interesting what they're doing.

Maria Antoniak:Yeah. So I feel like there's two things that I have to say about these models. One is that, of course, all the same troubles and dangers plague the modern version of these tools that plague our large language models that are trained on data sets that are too large to curate or document. So, we know that these art and text generation models contain not just stereotypes, but the potential to output incredibly disturbing, violent, and harmful content. And these are important problems we need to think about as researchers. And I appreciate the many researchers who are working to document those issues.

For example, distrust in the expertise, and in the media, and in the government is already fragile. And we're on the precipice here of something pretty alarming that I don't think any of us know what effects exactly it will have on society. It opens the door, these image generation models open the door to all kinds of horrible abuse, like revenge porn. And we need to be brave enough and care enough to talk about how these tech tools contribute to these kinds of terrible issues around the world.

That said, purely as an artistic tool or game, in my personal use of these AI assisted tools for art generation, it sparks an opposite feeling. It sparks a feeling of real creativity in me. I feel good. In the same way, when I am writing a story, or painting a picture, or playing a beautiful video game, it feels a little bit like that to me. Thinking of the prompts, deciding on styles and artists, and then waving this magic wand to create an imaginary world can be really wonderful.

So yeah, the blog post is showing just an excerpt from a poem that I've always had a strong mental image mental imagery associated with, for myself. And it seems so fantastic that I could actually try to generate those images as someone who's not really an artist myself. And I could do it, even in the style of one of my favorite Polish painters, who is no longer here and cannot actually create these paintings. And it blows my mind in a really good way.

Chris Potts:And the points you made about the dangers of these models – very well taken. I think that argues for more access by artists. Because I think artists could be the ones who, first of all, are unafraid to confront that stuff, and might also find a way to get the world to confront it in a productive way, so that we as a society can come to terms with all of those difficult things that you mentioned. And in addition, just the fact that it can empower people to do things that were different from what they anticipated or would be unable to do. It just seems like a wide open creative space.

Maria Antoniak:Yeah. And I mean, certainly involving people from other disciplines and... I said "other disciplines", but actually people who have studied, for example, art or literature for a long time, where we are just entering into those fields. I definitely think involving them is very important.

Chris Potts:Do you have time for two more questions?

Maria Antoniak:Mm-hmm (affirmative).

Chris Potts:So the first is, you spent a year at the Ukrainian Catholic University teaching English, is that right? That was in 2011? Were you already doing NLP at the time or was that in the future?

Maria Antoniak:No, that was just after I had graduated from undergrad, where I studied the humanities. I took some coding classes on the side. But I had not yet heard the words "natural language processing" or "computational linguistics".

Chris Potts:And what was the intellectual life of the university like at the time?

Maria Antoniak:Yeah, so the place where I was working, for the most part (I was also teaching in some other places around the city of Lviv), was the Ukrainian Catholic University.

I think it might be helpful to know a little bit of the history of that university. It gives you one small glimpse into the history of Ukraine and why things might be happening now, the way that they're happening now.

The Ukrainian Catholic University was originally founded in the 1920s. It's been around a long time, in one way or the other. But it went into exile in Rome after its closure in 1944, when its students were arrested or deported by Russia, and many of its graduates and professors ended up in the Gulag.

It wasn't until 1994 that the university was finally recreated in Lviv. And it's a university that's styled after Western universities, both socially – they have a dorm system – and academically – in he types of credit structures that they give their students. And its mission was and is to educate the future leaders of free and independent Ukraine.

And the university had a great energy while I was there. Everything from my perspective... Of course, I want to be clear, I didn't grow up in Ukraine. My father is Ukrainian. That's how I ended up interested in Ukraine and wanting to spend time there. But from my perspective, there was a really hopeful atmosphere at the university, and also in the city, and where I traveled around the country, that this university could be a place to learn about Ukraine and protect its culture and preserve it for future generations, while also growing these new leaders.

Ukraine has a beautiful and unique culture. And a lot of people have a lot for its future, and to preserve its identity, its language, its music and its history. So to see it, all these efforts hurt for no reason is really, really hard to understand from my perspective.

Chris Potts:The website for the university right now, as you can guess, is totally dominated by the experience of the war. Are you in touch with people there right now? How are they doing?

Maria Antoniak:I've been in touch with some of my former students and coworkers just via social media. For example, one particular story that really was terrible to hear, in the early days of the invasion, of this current invasion... I want to be clear this war has been going on for many years. It has just escalated recently. One of my former students has a young child, maybe I think one or two years old. And they were trying to get across the border to Poland, and she had to get out of the car and walk on foot for, I think she said, 20 hours, by herself, to reach safety. This is one of my students – who, as a teacher, you would want to do everything to protect. And what really struck me was, it's the exact story of my grandmother who carried my father on foot on that exact same route to Poland, to safety, 80 years ago. And it just doesn't make sense, why we keep hurting each other like this. It really doesn't make sense.

Chris Potts:Wow. That's a powerful story.

Maria Antoniak:I want to be really clear that what's happening in Ukraine is about Russian imperialism and a hatred for Ukrainian culture. There's a clear aggressor in this situation and there's a clear victim in this situation. How exactly to help in this situation is quite complicated, but certainly we can all donate money to help the victims, the Ukrainian victims in this situation.

Chris Potts:Excellent. We could arrange to add some links to the show notes if you want, just to make those connections easier for people. And that's great.

Maria Antoniak:That would be wonderful.

Chris Potts:A final question, that's maybe looking more toward a brighter future. I really liked your blog post that was kind of oriented toward, I think, people who are going to be applying for jobs, maybe a little outside of the field they were trained in. It's called "Data Science Crash Course: Interview Prep". It's a wonderful list of links. I infer from it that, if I go into this job market, people will definitely ask me about PCA. Do you have advice in connection with this blog post, for people who are trying to apply for these jobs a little outside of their comfort zone?

Maria Antoniak:Yeah, so I actually wrote that post for myself. I mentioned earlier that I really like productivity tools. I've used tools like Evernote or Notion. This is one practical recommendation: gather resources as you find them. So if you find yourself searching for things online, like "what is PCA?", multiple times, maybe save that to one of these tools. And over time you'll build up your own little textbook that you can then study from. So that's what I did. And that's where most of those links come from, is building myself this textbook when I was studying for internship interviews in the first couple years of my PhD.

In terms of advice for students, or people who are maybe interested in getting into natural language processing, or machine learning, or data science from outside: well first, please don't be intimidated. Nobody owns this field. And if anyone makes you feel intimidated, they certainly do not own this field. You're welcome here. And we need lots of people, and we need people with different perspectives, and from different backgrounds, all contributing together if we want to create a future that's going to be good for all of us. Science is for everyone, and computer science is for everyone, if you want to do it.

Some more practical advice: definitely build up your technical skills and don't be afraid to do that. If you're coming from another discipline, maybe if you haven't worked in computer science before or in a long time, it can be... I find sometimes I will self-limit. And when you notice yourself doing that, push a little harder in that direction where see yourself limiting. But also, try and get some hands-on experience at internships. That's a great way, if you can, to dig in and begin to work on something.

Apprentice yourself to people around you and see how they do their work. What kind of visualizations do they make? What methods? I think PCA is one that some people will just rely on. Like that's always going to be their first visualization of their data. So, check what they're doing, and copy and adopt things from other people who are ahead of you. Put code up on GitHub, create tools for other people – it's a great way to get feedback. But also continue to invest deeply in your other disciplines, if you have other disciplines. Again, we need those other types of expertise as well.

Chris Potts:That's wonderful. Thank you so much, Maria, for doing this. I really enjoyed this conversation.

Maria Antoniak:Thank you for having me. It's been really fun.

CS224U: Natural Language Understanding

Podcast episode: Maria Antoniak

Show notes

Transcript