Podcast episode: Sam Bowman

February 22, 2023

With Chris Potts

Lessons learned about benchmarking, adversarial testing, the dangers of over- and under-claiming, and AI alignment.

Show notes

Transcript

Chris Potts:Hello. All right. Welcome to the podcast, everyone. Our special guest today is Sam Bowman. I'm absolutely delighted to have Sam on the podcast at last.

Sam is, I'm proud to say, a PhD alum of the Stanford Linguistics Department and the Stanford NLP Group. He was advised by Chris Manning and me, and his thesis work truly opened the doors to building performant natural language inference systems with deep learning.

Sam got his PhD in 2016 and he went on directly to the faculty at NYU, where he's been ever since. He' is currently Associate Professor in Data Science and Linguistics and Computer Science there, and he's on sabbatical now at Anthropic.

By this point, Sam has worked on tons of topics in NLP, and he's also hard at work now on topics and meta-topics for all of AI, including alignment, benchmarking, what we count as progress, and other really big themes.

So, Sam, welcome to the podcast. I thought we might start by reminiscing a little bit. I often use the Stanford Natural Language Inference dataset, SNLI, as an inspiring example of students doing things that frankly seemed too ambitious to me when they were proposed, but then those students go on to achieve things well beyond my, or really anyone's, expectations. So what do you recall about the origins of the SNLI project, and what are your feelings about how it went?

Sam Bowman:Yeah! First, yeah, thanks for extremely kind intro and very glad to be chatting again.

All right. Where do I start? This is a long, long backstory. So I think maybe I'll start a little ways into my having gotten interested in NLP and neural networks. I and we were wrestling with this question of how to evaluate understanding in ungrounded classification models. If you have a model that maybe you can show seems to score reasonably highly on some existing sentiment analysis benchmark or sentence similarity classification benchmark, how can you get a better handle on how much it's actually doing semantics and pragmatics – how much it's doing the work of understanding – without grounded semantics, without having to wrestle with getting some kind of multimodal input into the model, which, especially with the models we were working with at the time, didn't seem super practical.

NLI, or Recognizing Textual Entailment, seemed like the obvious place to start here. It had a history both in NLP and certain pockets of semantics as a coherent framework for thinking about meaning that didn't require grounding. So yeah, we got interested in trying this out. There was a little bit of data out there, but pretty early on in our experiments, we were realizing the data was just really quite far from what we were hoping for. There was one really, really high quality expert curated data set. There was actually more than a decade old by that point. That was just exactly what we wanted and it was, I think, a hundred examples.

Chris Potts:Was that FraCaS?

Sam Bowman:Yes, yes, yes. At this point, we were training models from scratch. Pre-training of whole neural networks hadn't really shown up yet in NLP. And you can't train anything interesting with that much data. There was SICK. This was another effort that was explicitly designed around neural networks. It seemed a little bit closer to what we were interested in. We had some minor qualms of how it was put together, but it seemed promising. But that was about 5,000 training examples, and we played with it and didn't get that much to work, didn't get it to work particularly well.

Especially discouragingly, the creators of that data set ran a big competition, where they were challenging people to use neural networks to solve that task. I think there were at least a dozen submissions to this competition, and as I remember it, none of the top 10 or so actually used neural networks. No one was able to get this to really work. So it seemed like we needed something, minimally, pretty big, hopefully also good, but minimally pretty big, to get off the ground with this whole line of research.

Chris Potts:Yeah. There were also, at the time, the RTE data sets on the order of 1200 examples across a few rounds. So maybe the best case scenario is like 4,000 examples across all the data releases, right?

Sam Bowman:Yes, that's right. That's right. Yeah. So we had some data we could evaluate on. We had a little bit of training data, but it just seemed like the big problem was not having usable training data at the scale we needed.

Chris Potts:You had some faith, I guess, or you at least wanted to test this idea, that these deep learning models that people had started to really invest in could be best-in-class for the problem and really teach us something or ... is that the idea?

Sam Bowman:Yeah, I think so. So without going into too much of a sidetrack, I started at Stanford as a phonologist, largely, and what got me to really swing over into to NLP were some of these early results from Richard Socher, you, Chris Manning, Andrew Ng, showing these tree-structured neural networks were starting to learn to do some interesting things with language. This just felt like this hugely important result.

Chris Potts:This is nominally connected to the class, CS224U. Didn't you tell me that it was actually in that class with Richard guest lecturing that you had your light-bulb moment with all this?

Sam Bowman:Yeah. Yeah. I'd been hearing mumblings of some of these things, but yeah, Richard's guest lecture and just some of these early results. I remember that moment of like, "Wow, okay, this is the first time I've seen a scientific result where I think it's something I have something say about. It's in a field that I'm kind of following, and I have some idea of how I do research here." It feels like the open, future, follow-up questions building on this just seem really important and fascinating and there's a lot to explore there. So that was a big part of switching over.

Chris Potts:So it's like you'd seen Richard doing things that kind of were around sentiment. A lot of that was sentiment, and you had a semanticist in you. You wanted these worlds to come together, and then NLI was the good way to do that at scale with no grounding?

Sam Bowman:Yeah, exactly.

Chris Potts:So we're building toward your need for SNLI. You discovered that you had no data to test this idea about deep learning.

Sam Bowman:Yes, yes, yes, yes. At this point, we were already doing some experiments. We were doing some work on synthetic data, trying to get some sense of what kinds of primitives of reasoning models could learn, but pretty quickly felt like we weren't really convincing anyone but ourselves of at all anything interesting. It's very hard to tell anyone else in an applied field a story using completely synthetic data and environments, so that felt like it wasn't really going to let us answer the questions we wanted to answer.

Yeah, we're struggling with this existing data. So yeah, I don't think the original idea of building a set was mine. I think either you or Chris Manning sort of planted the seed. There was some off-hand comment in advising meeting. "Hey, if there really isn't any data, maybe this isn't what you signed up for, but you could just go build the data set." I think I didn't take it too seriously at the moment, but I think over a couple of months, I think I started realizing, "All right, I'm still convinced this research is important. I'm still convinced that the data we have isn't going to let us do it. I guess we have to actually do it."

Chris Potts:I'm just trying to reconstruct this moment. So what could have given Chris or me the confidence to say that? I had been involved with large crowdsourcing projects before then, but they were all data labeling. At the time, there was tons of skepticism that Turkers could do complex things, and that would certainly include creating a sentence – creating a sentence that contradicted some premise. Even the description of it sounds like people would say, 'That's impossible." Had any of the three of us done something that would suggest otherwise?

Sam Bowman:I can't think of anything. I don't think that initial comment, that initial conversation, pointed to how we were going to do it. I think it was just the seed idea – that very straightforward comment. I think, "Oh, if this is the obvious bottleneck in the research, go work on that and see if you can get some purchase."

Chris Potts:You know what? There were phases, both before and after, where we thought about trying to curate actual examples. So no one would write anything. It would be labeling. I wonder if that's kind of what we thought – that we could somehow find, naturalistically, entailment pairs, contradiction pairs. Do you think that's what we had in mind? Then we saw that that was going to be hard to work out, and then you took a leap.

Sam Bowman:Yeah, that sounds right. I don't think I have my day-to-day notes from that point in the process, but yeah, that sounds right.

But I can't reconstruct exactly how we got from this initial idea to the actual data collection approach, we wound up trying. But at some point we wound up with this. We ended up with the strategy that I think I did play a decent role in designing, of showing crowdworkers a sentence and asking them to write additional sentences, one of which would then be an entailment of the original, one of which would be contradicted by the original, one of which would be neither.

I think before we even piloted this, we just, through strategizing about it, converged on this idea of doing it with concrete visual scenes. So we landed on this idea that, all right, the particular version of contradiction – of what counts as contradiction – is just very confusing to explain in NLI. It doesn't actually correspond to what the word "contradiction" means in everyday language. We don't want to have to give these crowdworkers a semantics course to have them label data. But if you think about entailment in terms of photos, there's an easier way to explain this. You can say, "Oh, two sentences contradict each other if they can't plausibly both be captions of the same photo." So we decided to land in this odd data collection format where we were telling people this is a task about images, where you're going to be reasoning about images, but we never actually showed them any images.

Chris Potts:Did we ever show these images? We must have considered that, that they would see the image, or no?

Sam Bowman:I think we threw it out pretty early on with the reason that the image might skew people's interpretations in ways that then the neural network couldn't reconstruct. There some common ground or background that the crowdworkers would be assuming that then the model wouldn't have access to, and be at some unfair disadvantage.

Chris Potts:Okay, question, though – do you remember: what number did you have in mind as the size of this data set we would produce?

Sam Bowman:I think the original conversations were in the neighborhood of somewhere between 10,000 and 50,000 examples. I think we were really just trying to inch past what was already out there and see if it works, and then maybe if that worked, maybe then go back and look for funding and make it a little bit bigger if it seemed like that would be valuable. But start with something relatively small and explore.

I don't think that's how the timeline actually wound up working out. My recollection is we did some pretty small pilot experiments sort of well short of that. Then you'd gotten this lead, someone at Google, I think connected Ray Kurtzweil, had sort of thrown out a comment of, "Oh, if you need money for data for computational semantics, just tell us." So we hastily put together this document, sketching out the plan and showing a few pilot examples, and pretty quickly got enough funding to scale up to what we ultimately did, which is about 500,000 sentence pairs.

Chris Potts:That was a life lesson for me because I think we picked an amount that seemed reasonable and they instantly approved it, which instantly made me think we should have asked for much more.

And then you had an odyssey of being a crowdsourcing manager, for weeks on end.

Sam Bowman:Yeah, it was a pretty short project, at least for how big it was. And spending real amounts of money on research was something I hadn't really done to that point. Probably something like eight or ten weeks to collect almost the entire data set after that first little pilot portion. There was a little bit of experimenting as I went, but it was mostly just I'd sketched out these simple instructions, a simple interface for crowdworkers, and posted that, and was just sort of every few days, or maybe every day, downloading the data, skimming through it, seeing if there were any annotators we wanted to ban, if anyone was just submitting empty strings or something like that, re-posting that data to get re-annotated, fielding a lot of questions from crowdworkers, and posting little clarifications.

I think the day-to-day work for a few months was just kind of monitoring this process and managing people, but then pretty quickly we had a big messy Google Drive folder of CSVs from these various crowdsourcing jobs, of collecting the data and redoing a validation relabeling of the data, and we were ready to stitch that into what wound up getting released

Chris Potts:So we did our paper and we won an award. Congratulations to us, which was cool, and actually I think that was important. We were the first to win that EMNLP Best Dataset award, which I feel has really had a positive impact in terms of signaling that the field values dataset contributions. SQuAD was the next year.

Sam Bowman:That sounds right. Yeah.

Chris Potts:It's nice that they're both Stanford, but also really just showed: do a data set and you could have a big impact. I think that's been so great.

Sam Bowman:Yeah.

Chris Potts:But for the paper, we did a good job of being antagonists for this deep learning idea, and I remember us fitting really ambitious, sparse linear models, and having them be really good. That's just a reminder that, even at that time, deep learning wasn't a slam dunk, even with that much data. In fact, with that much data, this kind of cross-product feature-based model, where you take the cross product of all the unigrams in the premise and hypothesis, was really good at the scale that we had created with SNLI. What do you remember about those moments? You must have still had faith deep learning was the right thing, right?

Sam Bowman:Yeah. I don't remember having a really strong bet either way. I think I was just really excited to get to play with these systems, to get to see what they would do, to start trying to push on them, and run new experiments and develop new models. I don't think I had a ton of confidence that what we already had was going to really do well there.

To give credit, because I had just switched into NLP, I hadn't yet learned a lot of the conventional wisdom about building models without neural networks and so Gabor Angeli jumped in and spearheaded that piece of the project.

But yeah, they came out about tied, and that didn't last very long. I think within a few months, we were starting to see cleverer and cleverer neural network architectures pop up and eventually cleverer and cleverer pre-training methods that they were able to get us eventually into the neighborhood of our guess of human accuracy on that.

Chris Potts:Yeah, that's such a wonderful thing to see because that's what you hope from one of these data sets – that period of rapid innovation where the scores are leaping up by huge amounts. Of course it levels off, but you can see that something was unlocked by having data at this scale.

So then you went directly to NYU and almost immediately did the MultiNLI data set. By that point you were expert at these things, so I presume you were able to knock that out in a couple of weeks? Or what was that like as a project?

Sam Bowman:Yeah, yeah. I'm trying to reconstruct in my head exactly where I was in the end of 2016. Our motivation there was that we'd seen these sort of big leaps of progress on SNLI. I don't think its performance was maxed out, but it was quite high. We'd also seen how impressive of a score you could get with a system that didn't actually feel that impressive as a language tool more broadly. We were getting the sense that having a data set that was so big, but so homogenous, was really worrying. That was, I think, the big fear at the time – that almost all of these 500,000 sentence pairs were dealing with people playing with kids or dogs on a beach or in a park. It was just all these kind of personal everyday photo scenes. So the models were presumably able to, by looking at the train data, just memorize a huge range of specific facts about, okay, if someone talks about Frisbee and dog and then outdoors in the next sentence, then it's got to be an entailment. And all of these heuristics added up to working pretty well.

So we're hoping to at least push further on that. At least build something that would be a bit more demanding, and in particular that could measure model capabilities, generalizing beyond the immediate domain of the training data.

First of all, I guess what we didn't do differently: we collected Multi-Genre NLI, and we ran basically the same process. We didn't use Mechanical Turk, but we used a short-lived competitor that was almost identical. We followed very similar instructions. We kind of just rebranded the playbook. To give credit, there was a lot of fiddly work involved. Adina Williams led a lot of things. But the overall shape of the project was quite similar. But the big thing we did differently is we collected the data in 10 tranches, 10 pieces, that were each drawing the source text from a different setting.

So we had one chunk of the data where all of the sentences that these examples were seeded with were from news articles, another one where they were from transcripts of phone calls from the Switchboard dataset – we just sort of picked 10 of these that we thought reasonably broadly covered modern American English, and collected large amounts of five of them, and much smaller amounts of five of them, with the hope that a system would train on only the first five, but then test on all 10. So we'd get some picture of to what extent it's learned English, and to what extent it's just learned the quirks of Switchboard phone calls.

Chris Potts:It's a really cool thing that you did. What you're describing is the matched and mismatched conditions, and the mismatched one is really fascinating because of this signaling of awareness that our systems can appear to be really good, but only because of incidental stuff about how they were trained and the very charitable thing we do of testing them on similar data. You might have mixed feelings about this, but I would claim that this is part of the movement that led to lots of adversarial testing. Do you remember: was it really just the suspicion about systems that led you to create the mismatched condition?

Sam Bowman:Yeah, that's what I recall. I think, yeah, I can't pin down exactly the full spectrum of our concerns at this point, but what actually really surprised me, – what I thought was the big story of that particular move in designing the data set – is that it didn't wind up actually surfacing anything that interesting. I still don't really know why this is. When we tested systems on that dataset, quite consistently, even with systems that didn't have very much pre-training, that were primarily just trained on those five tranches of data, they would still generalize to these out-of-domain sections of the test set just about as well as to the end domain ones. So it must have just been that the ways in which the dataset was difficult, the kinds of heuristics the dataset admitted just wound up working pretty well with the kinds of systems people were testing at the time.

Chris Potts:Two more questions about NLI. First one is just about this, which is: you made the decision for MultiNLI to keep the test set private and for SNLI to distribute it. Do you have feelings now about what's better?

Sam Bowman:I still go back and forth on this, and I think it depends on how exactly you're framing a dataset release. So for SNLI, I think we were really thinking of it as a resource for researchers. We weren't thinking of it as a competition or a game or anything like that, so we just figured a few other people out there might want to try testing neural networks out on this, and we want to enable them to run their own little experiments on this and make their own conclusions.

That was part of how it wound up getting used. But it did also get used in this more competitive adversarial frame where people were writing papers where the takeaway of the paper was, my method gets a better headline number on SNLI than your method does. This style of paper is sort of famously unsatisfying and hard to draw conclusions about. Some of them incidentally showed some interesting things, but some of them I think were really very, very specific to that exact data set and in general create some suspicion that we really learned anything that novel or generalizable.

By the time we were building MultiNLI, SNLI had really taken off and sort of gained the status as a competition, and we wanted to make MultiNLI more amenable applying that role. A part of that was rate-limiting access to the test set. Anyone's attempt to evaluate on the test set was public, and they were limited to running a certain number of valuations. I think it was something like two a day, in the hopes that this could make it such that if someone was trying to hack the test set by running a huge number of systems on it, there'd be public record of that. It would be noticed. You couldn't too aggressively exploit that.

Once again, actually I have mixed feelings about how that worked out in this particular instance because I don't think we saw that much evidence of people doing their research practices differently on SNLI or NLI, or that much evidence that this technique actually stopped any suspicious behavior or any over-optimization on the test set. In retrospect, I think it was just such a big test set – I think we were testing on 20,000 examples – that it would've been pretty hard to stumble into good performance by accident or without actually building a system that works. To the extent that we did see research results reported on that data set that weren't interesting or weren't reproducible, I think it wound up being for other reasons. I think there might be some cases where you're releasing a data set, there's a competition behind it, there's a lot of interest in the competition, and maybe it's small or you expect it to be brittle to really aggressive optimization pressure of people trying to do well. But I think, for MultiNLI, that just didn't wind up being the problem that we hit.

Chris Potts:It led to another behavior, which is for MultiNLI, what people do is just put in the dev results in their paper because they don't want to bother with the test set.

Sam Bowman:Yes. Yes. Yes.

Chris Potts:Then you hope against hope that they have some other dev data from the train set that they're using in a way. The human behavior is an important element to me.

I would say that independently of this issue, if you just think about the history of these two data sets, a very clear insight emerges, I think especially from the SNLI one because you kept this leaderboard up to date so valiantly for so long. Maybe you still do it. You could see that our original idea about having premise and hypothesis encoded as separate representations fell away, and the best models, like strictly dominating these sentence encoding models, were the ones that just kind of fused the two together.

Then you can see that people started adding attention mechanisms across the two, and those models strictly dominated. The very final stage of this, as far as I can tell, is ensembling, and maybe that's just a lesson about machine learning in general, but the attention thing is no joke. The papers that originally motivated attention, in many contexts outside of machine translation especially, were on SNLI, and I claim that that's a big part of why "attention is all you need" in 2023 or whatever.

Sam Bowman:Yeah, I was happy to see people really developed a lot of the ideas around attention in this context. I know the first examples of this were were on translation, but yeah, there was some very cool work there.

Chris Potts:Oh, yes.

Sam Bowman:I think one of the only really generalizable lessons, if I'm talking to another researcher who's just getting into this idea of building benchmark data sets, one of the only generalizable lessons I feel like I learned from this process that I'll confidently tell people is: as much as possible, to avoid building a dataset in a way that makes assumptions about what system is going to be used to solve it. I think we came very close to doing this in how we presented SNLI and how we put this leaderboard up on the dataset website. We assumed that people would be using it as a way to test vector representations for sentences, and it turned out to be the case that, if you want to solve lots of tasks in NLP, you don't actually have any part of your network that is a single vector representing the whole sentence.

Fortunately, we'd sort of left in an out – that we would still record on leaderboard entries that didn't do this. We actually came close to doing this again with the GLUE benchmark later on, where there was a big push to, again, focus on a pretty narrow paradigm. I think at this point it was multitask learning from scratch approaches, and at the last minute we changed our mind and decided to allow any technique that could produce a number on the test set. I think that, again, was a big part of what allowed these data sets to actually become sort of accepted benchmarks for what turned out to be an interesting direction in the technology.

Chris Potts:Cool. I want to pick up the GLUE thread. I think that's really important. But final question: what's your view at the current moment about NLI as a task?

Sam Bowman:I'm surprised I haven't been asked this question before that I can think of. I think if I were a bit more of a troll than I am, I would say that it's solved. I don't think that's quite fair.

Chris Potts:Here, let me just offer you one perspective, which is that if you think about formal semantics, this is the most natural thing in the world because at some level, formal semantics is entirely about context-dependent, entailment – and contradiction, but these are all unified notions. And so it's extremely natural to operationalize a lot of that stuff, down to the lexical level, ut in terms of full sentence meaning in context too – operationalize it as a task like NLI. I think that's actually why FraCaS exists – because that was all people like Robin Cooper, like formal semanticists, who wanted to touch computers and do computational semantics, and this is the task they formulated.

But you could also regard that as kind of dogmatic. Yeah, that's the way formal semantics works, but that's not the way human semantics works. Human semantics is more about what's a good answer to a question and stuff like that. NLI is actually a really unnatural task from that perspective. So setting aside whether it's solved or not, that's why I go back and forth on this, but fundamentally I'm really a supporter of this idea.

Sam Bowman:Yeah, no, I think I agree. I agree that NLI is a useful lens on semantics broadly, both sentence-level semantics and just semantics as a whole, and that to the extent that almost all of the things that we're asking contemporary large language models to do are in some significant sense semantic tasks, then it's clearly not solved. There's clearly important headroom there. I think I would say though that NLI sort of feels most intuitive, or fits most neatly into the literature, and into standard practices in these fields, as a way of thinking about sentence-level compositional semantics, as a way of thinking about: is a model adequately keeping track of things like scope.

My read is that's not perfectly solved, but that the biggest, best available systems we have now, the kinds of things that are behind ChatGPT, even though it's not solved, that's sort of not where the central research questions are, at least on the applied side around what these systems are doing when they understand language. I think enough of the building blocks for NLI and the paradigm cases have really pretty robustly emerged in these systems that I think the applied questions are elsewhere. I think there's cool science to do looking at how these things emerge, why they emerge, where they still fail, but I think I'm okay with the fact that it's kind of no longer in quite a central role in how we evaluate the overall semantic competence of these systems.

Chris Potts:That's so interesting. You and I went back and forth a little bit about this on Twitter in the fall where you said that language models in context are good at reasoning about negation – a classic scope taking puzzle – and I said they weren't. We had slightly different things in mind. So you had in mind very naturalistic prompts that are kind of like little stories that the model needs to continue, whereas I was asking it to reason like a semanticist would. What we figured out, I think together, is that we were both right in the sense that these models really struggle with my formulation and are good at yours.

Now, that is certainly eye-opening, and relevant to the mixed feelings I have in general, because it's like, okay, good, it could do your kind of prompt, but you also couldn't trust this thing to read a legal contract because it could not possibly deduce the consequences of the clauses in this contract. So the idea that you'd get legal advice from it, hopeless. It's a disastrously bad reasoner. But if you give it naturalistic discourse, then it looks good.

Sam Bowman:Yeah. This is what's so fun about doing research on these systems, this dynamic. There are all of these capabilities in these systems that are often reasonably robust. I think it's the case that if you set up a naturalistic story that really looks like the kind of text that these models are deeply familiar with, you can ask these models to solve a genuinely novel question where they really do have to work through the reasoning and they'll do it right. But if you give it an exam that's really meant to stress test these capabilities, they fall apart.

So obviously if you want to try to use these systems for things like legal reasoning, it's this huge puzzle of how do you figure out how much of the problem you care about is in the model's comfort zone and how do you expand that comfort zone? As a researcher, it just really feels like this kind of treasure hunt of trying to take the system that has all of these abilities that we don't know about and see if you can spot signs of how they can be elicited.

Chris Potts:Nicely said.

Let's return to GLUE, because this is another important moment in the field that you helped create. So I'm not sure about you, but I and many of my colleagues, when we saw GLUE, thought, this is probably going to be too hard. I think you might even allude to this in the paper – a kind of underhanded concern that you might have set a benchmark that was going to be too difficult, and yet nonetheless it proved like a North Star and progress was rapid, and then you released SuperGLUE, same instincts, and I guess progress was even more rapid. What do you make of all of this in retrospect?

Sam Bowman:Yeah, I think both of these projects, we were very much in the right place at the right time. I think, in both of these, I was part of some of the very early conversations. In one case at Stanford, in one case, I think I remember a workshop at EMNLP being sort of formative in planning out the GLUE project. I'm not even sure what this came out of. I think it might have been an informal argument after the end of a panel or something like this. But yeah, I think in both cases, I was in these very early conversations about what important technical questions might be available to be asked if we figured out the evaluation. In both cases, I think there weren't that many people who actually wanted to put in the sort of messy work of figuring out the evaluation. We were just able to take that opportunity, set something up, and play a selling-shovels-to-goldminers role in two big waves of NLP research. I'll admit that in both of these, and especially in GLUE, neither of these was the only effort in its space. Especially with GLUE, I feel bad that they came out at around the same time as this benchmark DecaNLP, which posed an even slightly harder challenge.

Chris Potts:Right.

Sam Bowman:This was from Salesforce Research, and no one picked it up. I think there were a couple of early efforts. It just looked really hard, and I think GLUE was a little bit easier. It was also getting a lot of attention, and it wound up just sort of being the default thing that a lot of people evaluated on for a few years. I think DecaNLP actually did have a very prescient idea behind it, which was this idea of tasks as natural language objects, this idea of the prompt being how you represent a task. I don't think they fully anticipated the way that we think about this now, with things like zero-shot learning and models like GPT-3 or FLAN, but I think they did sort of see a lot of that promise in a way that I completely didn't, and they unfortunately struck even too early.

Chris Potts:But that's interesting. You tried in your own way. So you snuck into GLUE some tasks that were meant to be kind of too hard. So there's Winograd challenges in there and there's even a grammaticality section, which is a nice coup for linguists. So you were trying to make it hard and diverse, but it was something people could make rapid progress on.

Sam Bowman:Yeah, yeah. I think our hope with it was that it would be something that had a long gradual slope – that there was room to make progress across a pretty wide range of degree of competences of the systems being studied, and a pretty wide range of different technical questions around those systems that could all lead to making the number go up. All of that work wound up happening in a pretty short period of time. But I think that did largely pan out.

You had a lot of work that first originally ignored the Winograd schemas dataset. There was even a first generation of papers, including the BERT paper, that said the GLUE was an eight task benchmark when in fact there were nine tasks, and just had a footnote saying, "We ignore the Winograd schemas because they're impossible." Then not too long after, I think, I forget if this was starting with Roberta or T5, it turned out that just sort of better trained versions of lot of these same systems were able to start hitting these really hard tasks.

Chris Potts:I have to ask: have you ever had a paper rejected on the grounds that you didn't get SOTA on GLUE? Because many of us have. I wonder if you have mixed feelings about this.

Sam Bowman:I think I might have. I'm trying to pin it down, but I can think of a couple of paper rejections that did get very close to this.

Shortly we finished the GLUE project, I did one of these JSALT summer workshops. This was the program that Johns Hopkins organizes, sometimes actually on campus, sometimes elsewhere, but I did the on-campus version where they brought me and about a dozen other people out to sit in the same room in their Computer Science building for six weeks working on a topic. Ellie Pavlick and Tal Linzen were co-leading this. Ian Tenney from Google helped lead. It was a great fun team.

What we're trying to do is just come up with multi-task learning setups that would allow us to make progress on GLUE. I think the paper that first came out of that wound up getting rejected from the first place we submitted it. I don't remember quite what the concerns were. I'm not sure that they were simply that we didn't get the state of the art, but that's what came to mind when you mentioned that experience.

Chris Potts:That's the event that gave rise to probing, right?

Sam Bowman:Yes, yes, yes. This was another case of me having not particularly prescient judgment. Ian Tenney and Dipanjan Das, Ian's supervisor who wasn't around in person but participated a bit, came to me and Ellie with this idea of: what if we explore this kind of generalized task of training a smaller network off of the intermediate states of larger neural network as a way of understanding what the large neural network was doing. I think my reaction was, "Oh, there could be some fun science there, but that's not really relevant to the question we're getting at, whatever. Do this on your own time."

I think Ellie was significantly more encouraging and saw that this could be pretty important. To give credit, I think the very original idea of probing, at least in the NLP literature, is probably due to Yoav Goldberg's group. I think Yossi Adi was the first author. But I think this paper by Ian Tenney really, really showed the full scope of what was possible and really did the first, I think, comprehensive effort at explaining all of the linguistic work that's going on inside a large neural network. I thought that paper was extremely exciting and was very proud to have been something like 11th author out of 13 on it.

Chris Potts:Well, for these moments when you're skeptical and then regret it, you can always tell yourself the smart advisor thing to tell yourself – sometimes a skeptical reaction can enliven people's imaginations and make them passionate and lead them to innovation. So you're still playing a positive role, even as you say, "Really?"

Sam Bowman:Yep, yep. I certainly hope so. This is still a common experience for me.

Chris Potts:One more question about datasets, since you have been so involved in this. What do you feel we learned about dataset creation as a field over the last... Well, since SNLI, say, but you could go back further.

Sam Bowman:Hmm. I could take this in a few directions. I think one of the biggest points in the background – and this was not so much something that we learned by building these data sets as much as something the field has learned – but I think, as pre-training works better and better, as the state of practice in NLP focuses more and more on building a single general-purpose system, the need to collect training data sets as part of benchmarks is becoming less and less substantial.

With SNLI, the whole point of it was that we needed half a million examples in order to get any handle on whether neural networks could show this behavior. Now with Big Bench, I think, which is in some sense spiritual descendant, or playing that role, most of the tasks have zero training examples.

I think that's been a really clear, interesting trend. Paralleling that, if you're moving from creating a dataset that is very large and that is targeting a technology that's pretty immature to creating a much smaller dataset without a big training set that's targeting a technology that may be a little bit more mature, the quality versus quantity trade-off in how you do data collection, really changes. A lot of my work with data collection has really followed this progression, where first for SNLI, and more or less for MNLI, we used Mechanical Turk, hiring people over the Internet without much of an effort at qualifying them or deciding if they were the annotators we wanted to work with.

Later, working a fair bit on Upwork, which is still crowdwork, but it's crowdwork where you're building individual relationships with annotators. People might submit resumes or work samples. You're really working with people as individual professionals. We subsequently moved on to Surge, which I think is similar, but I think the platform then does even more work at pre-screening the annotators and matching you with people who are really going to be a good fit for you. So it's even more oriented toward building these close working relationships.

Then finally, I did one data set project recently where basically all of the annotation work was done by late year PhD students.

Chris Potts:I agree, and it's wonderful for the field that now you can invest in test sets, dev sets, and not spend tons of time on train sets. One change in my behavior is that, in your era at Stanford, for my NLU course, I would discourage students who said they wanted to create a data set. We almost forbade them from doing it because we knew it would take them the entire time of the course plus, and they wouldn't make any progress on their project. Now I do the opposite. I say, "Hey, spend a few hours creating a few hundred examples from a domain you care about and you're off and running. You might have to do it a few times, but the overall investment is small." The payoff can be large, and we can learn so much more now because we don't have to focus on a single set of predefined tasks. People can be very fluid in what they decide to develop systems for. The culmination of that is all the stuff with in-context learning, where now you design a system via a prompt, essentially, or a strategy for prompting, and now you're really designing systems all the time for lots of tasks.

Sam Bowman:Yeah, I completely agree. I should say I'm teaching a project-based advanced NLP class that's very, very closely modeled on your NLU, and I've had the same experience the last couple of years. The data set projects have gone from being these real sort of nail biters – "Are they going to submit anything? Are they going to have anything to say"-kinds of projects, to the ones where we learn the most, that are sort of the most successful. That's been fun to watch.

Chris Potts:So now I have to ask about adversarial testing. I'll kick myself if I don't, because you and I have mixed it up a little bit about this too. So I'll claim that, in these contexts, it's incredibly productive for people to think of hard examples that they know will fool their model and evaluate on them as well. That's the essence of adversarial testing for me. But you've taken, in the past, a more skeptical stance on it. What's your current view about this practice that I advocate?

Sam Bowman:Yeah, I've been on record as being skeptical of this particular approach that I think the Dynabench platform highlighted, though I'm not sure it's fundamental at Dynabench, where you collect data in that way. You collect data by working to find examples that fool a particular model, and then you publish that as a test set, as a conventional benchmark for use in other people's experiments on other models. I think the way that your choice of model and the choice of model in that subsequent experiment interact breaks a lot of the assumptions of normal experimental design in NLP and is an easy way to get kind of misleading results.

The underlying idea of getting to know a model in this very hands-on way, trying to figure out what breaks it, I think that's still extremely valuable. I think that's very valuable completely qualitatively, as a way of building insights about these models can do, and also somewhat quantitatively, in that how hard it is to trick a model is a useful measure of that model's robustness and capabilities. So I think that the underlying idea, the underlying method, is great. I've just got some quibbles with how it's interacted with the practice of building these kind of static benchmark data sets.

Chris Potts:Got it. But here's another perspective, and this Dynabench just supports this directly, so I want your reaction. So you can see that if you deploy a system into the world in the current moment, the world will be incredibly adversarial with you. We've known this all along, but with all the hype and all the attention, now the world really comes down on your system. If there is a problem with it, people will find the pattern of examples that exposes exactly that problem. Before we do the deployment, it's hard for us to anticipate all of that. We should have diverse teams that are trying to do this in-house to simulate the real world, but there's only so much we can do.

So Dynabench was a mechanism for interactively helping some humans figure out where the danger zones were, and in a goal-oriented way end up constructing examples that were definitely going to cause you some embarrassment in the real world. Then of course, this is all in-house and so you can improve your system before the deployment part. Dynabench supported all of that. So was there something that you have a quibble with in there?

Sam Bowman:No, I think all of that is really great and I think, yeah, I had many, many public arguments, especially with Douwe, who led this project, that ultimately came down to, this feels so close to something I would really be excited about for exactly those reasons you described. It was almost this UI decision to foreground these test set numbers, test set numbers evaluating across models instead of foregrounding either just the actual list of examples or this other quantitative metric that Dynabench uses that I really like, the validated model error rate, which is just asking how many times do you have to interact with a system before you can make it confidently mess up?

Chris Potts:We're fundamentally in agreement. Dynabench having only one model in the loop – I think it would be great to have 10. That just seems like a technological limitation, but I think fundamentally the idea that you give your system some grief before the world gives you some grief is a good idea, and can help us a lot.

Sam Bowman:Yes.

Chris Potts:The reason I wanted to mention this is that you've been involved with the Inverse Scaling Prize, and this does seem like the current moment's version of what we did in the past because this is, at a high level, an attempt to figure out where our biggest and presumably best models are actually kind of falling down on the job compared to ones we might think are simpler. Maybe you could fill in some details on Inverse Scaling and give your perspective on it.

Sam Bowman:Sure, sure, sure. To give full credit, this was a big team project and I was not the lead. Ethan Perez, NYU alum now at Anthropic, took the helm of this, but I helped out. Yes, this is coming from this interest that I've been quite excited about and that a lot of people I think have been recently of, what can scaling trends tell us about large models? What do you learn about something like GPT-3 from looking at this progression, from the small models to the medium models, to the big models, to the very big models, that you don't learn by just evaluating that biggest model?

The ubiquitous story there is that, for everything you measure that gets published at least, the curves go smoothly – the bigger models do better at the task. Occasionally there's a dramatic jump where the small and medium models do terribly and then the larger model does much better. But largely you just see these smooth trends. This sort of feeds into this assumption that, at least for the kinds of model capabilities that we're tending to measure, we should assume that everything gets better and everything gets easier. For this kind of task in three more years of further investment in scaling, just everything will be great and easy. We were interested in where exceptions to that were going to be.

Yeah, we wanted to pose this question of, are there areas of model behavior that aren't getting better cleanly in this way? We initially started this as an in-house project with a few NYU collaborators and gradually brought in more and more people, made it a bigger and bigger project as we realized that we weren't finding anything obvious, that all the cases we were finding were sort of tricky and subtle and not exactly showing what we were looking for, and ultimately wound up putting out a public call for ideas here.

I think I'm on record saying this is a little bit of a long shot. There's a few different stories you could tell about why models might get worse as you make them bigger. None of them are all that general or all that compelling. In fact, I think the results we got were a bit of a mix. We had these prize rubrics written up in advance. We gave out a bunch of third place prizes to some really cool efforts, but nothing hit the first place criterion of really telling us something big and important that we had to watch out for that might change how we think about scaling.

Chris Potts:Fascinating. Is it ongoing, so people can still try for first place, or is it closed?

Sam Bowman:It's closed. I'd love to see papers come out, data sets come out, that aim at the same question, but the actual prize ran for about five or six months last year and wrapped up in the fall. I think we're now in the process of writing up the paper recording all of our results.

Chris Potts:Oh, cool.

One more potentially controversial question for you, and then I want to move into talking about you, Sam, a little bit more. The controversial one is the stance that you took, which I think you knew would be controversial, in the paper that's called "The dangers of underclaiming: reasons for caution when reporting how NLP systems fail." So I'm wondering, yeah, what's your overall take on the paper now? You could say a bit about what you were saying in the paper and also your current perspective, I suppose.

Sam Bowman:Yeah, so maybe I'll start with some of the context that this paper was written in. What I was seeing specifically in the NLP research community, by which I mean academia and sort of very academically active labs, like Google and Meta and Microsoft, was this attitude that I even heard people say explicitly once in a while that: if you overestimate or over-report how impressive your system is, how good your system is, that is deeply irresponsible. That's a serious problem. That's probably research misconduct if it was intentional, but that was something very seriously to be avoided. But the opposite, that sort of claiming your system was less capable than it was, was excusable or even desirable. It was sort of perfectly fine to describe your system as not being able to do things that it was able to do if you didn't have some reason to be using that capability.

I think this was largely in reaction to hype from industry and places like OpenAI, that this practice was seen as kind of pushing back against irresponsible real over-claims on the other side of some divide. But this paper was laying out a few reasons that fighting bad science with bad science could backfire, that ultimately what we want to do is in fact just try to aim to carefully report what systems can do. The paper sketched out a few reasons that having a false or misleading consensus that our systems were weak in ways that they weren't in fact weak could just make our work as scientific community harder, could make it harder to navigate a lot of social impacts and policy issues. Also, a lot of the real technical issues that are coming up with the biggest, most capable models – it made engaging with those issues harder.

Chris Potts:So I think we're in agreement there that fundamentally the call is really just for accurate reporting, right? No hype, but also no under-hype.

Sam Bowman:Exactly.

Chris Potts:Just trying to get it right, which it's worth reminding people that that is the goal. That said, there is a part of me that worries about the perspective you're taking because you're probably like me in that you get a lot of surprising email from people out in the world trying to do things with AI systems. Even to this day I get messages that might as well ... This isn't a literal case, but they might as well say something like, "I am using your Stanford Sentiment Treebank model to make predictions about job applications and I can't get the Java utility to work." I'm like, "Nevermind the Java utility. Do you have any idea what you are doing? Because it's trained on movie reviews and has no knowledge of professional contexts, and this is absolutely disastrous from the point of view of real world outcomes, you should not be doing this." But something about the world has led them to believe it's reasonable.

Or for the case we discussed before, I'm probably about to get an email that says, "I'm a lawyer. I'm hoping to use ChatGPT to help me with contracts." They may think that's totally reasonable and I'm going to have to explain to them that ChatGPT struggles with exactly the language that you're going to be feeding in. The world, though, has told them a very different story. So there is this temptation to reel it in. I could prevent real harm in the world by being really forthright with these people that I think both of those projects should not even be pursued, which might be under-hyping, because there's probably something that will work. But I feel this urge to tell them, "No way."

Do you not feel that? Because very often ... I saw you give a talk and you expressed kind of the opposite view, which is like, "No, no, we should be more encouraging, because the real danger is that we will fail to tell the world about all the things we've done."

Sam Bowman:A few different things to pick up on here. So I think first I'll say the paper was very much aimed at the immediate research community. I felt like this was something that I wanted to push back on in how researchers were talking to one another, much more than in how ideas were communicated to the press or externally. Because yeah, I completely agree that there is a lot of hype, there are a lot of false claims out there, false assumptions, misunderstandings that do get picked up in the media about what systems are capable of, that lead to exactly dumb ideas like hasty attempts to do resume screening with our models.

Chris Potts:We can't even just blame the media because, we don't have to name names, but we all know that there are people who are leaders in our field who are guilty of over-hyping.

Sam Bowman:Yes. So yeah, I think that over the over-hyping is real. I was targeting this audience that I think was at least sometimes erring too far in the wrong direction. I was at least hopefully not aiming to present the solution as anything particularly easy or straightforward. I think the goal was really just like, we don't get these easy wins. We don't get these kind of easy throwaway PR messages about these systems. We've actually got to do the work of doing these evaluations. Many times we probably will come to these conclusions. I think in fact, if someone is using one of these data sets to build a resume screener, that's an enormous red flag and we can pretty easily point to evidence in a totally rigorous way that, no, that will go wrong in these 10 ways.

But yeah, I think what I was aiming to point out is just that if we're fighting genuinely misleading claims with genuinely misleading claims, that there are some risks on the other side. I think there are some risks to it being a consensus in certain communities of practice that these systems don't do anything. I think also there's this risk that I maybe have been seeing more clearly as I'm getting closer to the industry heart of the large language model community.

There's just risk of very serious, very capable technical researchers not being taken seriously. This risk of there being a perception by people outside of this core research community that we're not interested in engaging with a lot of progress that is genuinely being made, and creating a split in the field in a way that doesn't help, creating a split in the field that's almost based on whether your conviction is that models work or your conviction is that models don't work, when in fact I'd like to be in the middle ground where we're just studying the models and trying to report reasonably accurately what they can do.

Chris Potts:Yeah, that's interesting, that dynamic about being out in the world and us not being taken seriously. In the current moment, I actually feel like we're taken very seriously, and that's just another reason to get it kind of exactly right, which might mean remaining silent for some questions that we just don't know the answers to.

But when you look inward, I personally wince like you do when people say, for example, models can't handle negation and they cite a paper from 2017. I think things have really changed when it comes to models handling negation, especially if we're talking about fine-tuned models.

On the other hand, I also wince when I read casual throwaway statements like "models have solved part of speech tagging." That feels like an over-claim and I know that has consequences for researcher choices. So again, this is just about the middle, about getting it exactly right. And in your case for a lot of your examples, it's just about actually having the evidence that you purport to have for the statement that you want to put in your paper.

Sam Bowman:Yeah, yeah.

Chris Potts:If the fundamental message is kind of like we want to get it right, and that the stakes have become very high for getting it right, then I think we're in total agreement.

Sam Bowman:Yeah, I say things close to that and, if I could add a line to the conclusion, I think one additional comment I'd make along these lines is just that whether we wanted it to or not, the field of NLP and language modeling research has come to encompass this really enormous range of questions and applications that maybe many of us didn't sign up to study, and that the systems that we're working with are systems that are kind of less and less engineered, less and less designed. So the range of things we don't know about them has gotten bigger and bigger.

In many ways the empirical situation we find ourselves in is maybe harder than ever that we've managed to build system with the work, but we know less about the state of practice. Maybe I was going towards a slightly too strong claim, but the range of important open research questions is really enormous now.

Chris Potts:Absolutely.

Sam Bowman:Maybe even more so than in the past.

Chris Potts:This is probably a nice hook into this thing you've done. So you are starting or have started the NYU Alignment Research Group and you've carved out an intellectual vision there. I'm just kind of curious if you could inform me a little bit on what on earth AI alignment is. I hear the phrase all the time, I see it a lot on Twitter. It's the most confusing rhetorical situation. I can imagine someone could do a study just on framing, about this phrase, like a linguistic study, and maybe I would be illuminated there, but can you help me navigate this whole mess of usage and concepts?

Sam Bowman:Yeah, I think there's a couple things going on here. I think there are a lot of terms that get thrown around Twitter. There are genuinely just a bunch of different senses that are loaded in very different ways of what people mean by this. I think there is a useful technical term here, and this is why I am willing to use this term in some settings. I think it is potentially confusing in that it relies on a distinction that doesn't fit a lot of intuitions that are common if you've been working with machine learning systems for a long time – it's this alignment/capabilities split.

There's this intuition behind this that the work of building a highly capable AI/ML system decomposes deeply into two pieces. There is the piece where you teach the model how the world works, how to interact with it, how language works in the foundation model paradigm. This is the pre-training piece. Then there's a piece where you communicate to the model what you want.

In the first piece, you're giving the model almost no information about what you actually want it to do. And in the second piece, you are doing almost nothing to improve the model's understanding of the world, of the model's ability to act in general. So actually, I think framing it that way might bring up some just really clear, straightforward examples of this, which is that if you pre-train a BERT model and then fine-tune it to do textual entailment, you're more or less doing first capabilities work and then alignment work, in that we assume that during pre-training you're mostly teaching BERT about the syntax and semantics of English and some things about the world that are useful for language understanding. Then when you're fine-tuning it, you're mostly just teaching it: What do these three labels mean? How do you do the entailment task?

But yeah, so I think alignment points to this kind of general question of what do you do in the second half in a way that emphasizes how it might be hard and how it might be a different kind of machine learning problem from the kinds of problems that we're used to in pre-training and that we're used to in sort of building models from scratch.

Chris Potts:That's what alignment means? That is fantastically technical and at some level it's just a description of the mundane everyday practice of trying to build and improve a system throughout AI. That can't be what all these people mean. Maybe that's what you mean, but that that's just a weird and confusing branding for something we all do, which is just to try to improve things.

Sam Bowman:Yes. So I'll say two things there. First as just a nit-pick, I think trying to improve things really does cut both ways. There are ways of improving ML systems that are largely about making them generally more capable and don't quite fit this frame. But no, so I think what makes it confusing is this bit of the context in which it tends to come up, which is that if you're working on fine-tuning BERT, you can just say you're sort of working on fine-tuning BERT. There are existing terms and concepts to point to this. The reason that this is hard and the reason this is interesting fit in well with the normal way that people talk about machine learning. Alignment tends to be most useful as a concept and tends to come up the most when talking about extrapolating the capabilities of future systems.

I don't think this is really tied up in the definition of the word or why it's useful, but I think people tend to talk about alignment in this context where maybe you imagine a system that already knows perfectly how to do everything you want and it still doesn't actually do that thing. You have to find some way to robustly, confidently communicate to the model how to do this, how do that thing. I think that framing of the problem suggests an emphasis on certain technical directions and de-emphasizes certain technical directions that I think are actually interesting technical directions to pursue.

Chris Potts:I see. But I just had the vibe that the reason this term is rightly controversial is that alignment is kind of symmetric about, or doesn't say, who you're aligning with.

Sam Bowman:Yes.

Chris Potts:Then the wide open questions are, am I aligning to the needs of current people, current wealthy people, current impoverished people, or am I primarily concerned about future people? Like, I'm a futurist. Then the same questions arise, and if I do this alignment, what are the costs to the other groups that I just mentioned? If I decide to align to future rich people, there're going to be a lot of losers in the present and in the future. This is hugely consequential and somehow a lot of that is, I can tell, being snuck into this thing. So you might have the technical definition in mind, but I think when you use the phrase, you are taking on at least a lot of this baggage that I perceive.

Sam Bowman:Yeah, I think some of that's there, some of that I disagree with, but yeah, I agree-

Chris Potts:Well, I'm not trying to agree or disagree. I think all of that is in the mix of this rhetoric at this point. You might be against that, but I don't know that I'm wrong in my characterization, right?

Sam Bowman:Okay. Yeah, I think I'll just quibble maybe with exactly with where some of this context is coming from.

Chris Potts:Feel free to quibble. Yeah, that's the whole idea.

Sam Bowman:Yes, yes, yes. So yeah, I think first, reading literature for me that emphasizes this alignment frame, this split of aligning to whom comes up pretty often. I think one reason that this does look like a slightly weird frame, coming very much from technical NLP work – and I have this experience as well – is that it often gets answered in a way that sort of delegates. You'll see the split between technical alignment, which is the problem of making a system do what we want for any definition of "we" – just making a system sort of generically steerable. Then AI governance is usually what it gets called – or potentially future AI governance – where you say, all right, assume we have a system that is capable of doing lots of things. We have techniques for steering it to do what we say. Now who's the we? Now how do you navigate all these dynamics around, for example, the concentration of power that comes with having big general-purpose systems owned by a small number of companies.

I think my sense is that a lot of people who do talk about this do tend to work pretty closely with these issues and these research communities more on the AI governance, AI policy, responsible AI side that are very much about these dynamics. But we'll write papers that are very much tackling just this technical problem on the assumption that this factorization is fruitful, will let you make some progress that would be hard to make if you lump everything together.

I'm not sure I completely agree with the factorization. I think there are some interesting research directions that more orient toward methods that specifically are oriented toward teaching models to behave in certain ways or to align with certain values. But it does seem like largely this problem of getting models should not just go off the rails to reliably implement some particular goal does feel hard and probably valuable. It's completely compatible with thinking seriously about then what it is that you put in that instruction slot.

Chris Potts:Got it. Yeah, undoubtedly that's important. I took that to be a shared goal across the whole community. No matter whether you like the phrase or not, we all want systems that succeed in doing what we intuitively wanted them to do. So the correct specification is hugely important.

For you in particular – setting aside this whole swirl of difficult framing issues – for the NYU Alignment Research Group, two questions. What big things do you want to take on, and who do you imagine, as the group fills out, what of people are going to be involved?

Sam Bowman:Yeah, actually I'll first maybe just throw in one point that's relevant to this, which is that something that's sort of been a little bit maddening about getting interested in this research community that's very specifically paying attention to frontier models is that there's this really, really loud public fight going on that I think doesn't actually get very close to what most people on either side of the fight seem to be engaging with. I think you'll see people - like Eliezer Yudkowsky was maybe one of the people on Twitter who popularized this alignment frame. Any time he disagrees with someone in argument, he makes the rhetorical move of, "Oh, you disagree with me, therefore you're wrong and probably foolish. Let's explore why." It just can come off as incredibly condescending.

I think there's pieces of the responsible AI/ethics scene within computer science that really react to this, that are really just very culturally opposed to anything that is close to just such a sort of condescending tech-bro attitude around these genuinely really difficult political social questions. This just devolves into shouting matches. When I talk to people who are actually doing this work, whether or not they emphasize the alignment frame, whether they're emphasizing present-day systems or capabilities we think are going to emerge the next years, I think most people that I have interacted with have actually gotten that there is in fact nuance here. So just wanted to flag that.

Chris Potts:Makes perfect sense. So you have this opportunity, or you're creating this opportunity to do something different under this rubric. That's where the group questions kick in.

Sam Bowman:Yeah. Yeah. So yeah, I think the focus in setting up a group called the Alignment Research Group at NYU is first, just very straightforwardly, to make this explicit commitment to try to focus on these research questions that I think will stand the test of time, that I think are not specifically tied to contingent limitations of current models that are predictably going to go away. I think this is just a useful orientation for doing the kind of technical work I'm interested in.

Practically this means trying to build bridges in two directions. When I'm hiring, when I'm trying to find collaborators, and I'm sort of trying to keep both of these in mind: on one side, the community that more self identifies as doing alignment research, I think this is a lot of the research and governance organizations at OpenAI, Anthropic (where I'm visiting), Deep Mind, a few smaller nonprofits like ARC and Redwood Research.

So building connections there, engaging with that research thread, and also staying engaged with the kind of more mainstream, grounded, pragmatic NLP research community, and staying engaged with the responsible AI community as it formed around large language models and NLP. Yeah, my hope is just to find big long-term questions that, as a technical researcher, I can make progress on and just get people who are in all of these conversations in the room to work on.

Chris Potts:Got it. But is the important work technical work, or is it like policy work? Should you have people who are essentially lobbyists in Washington, or can we also just look inward all the time and think about how to better specify our systems to achieve what we want? Or does it have to be both?

Sam Bowman:I think the work in the big picture is both. I think the work that I would like to see in the world is very much both. I do plan to be hiring people who are much more literate in issues around these trickier ethical issues and policy issues, but still basically, at least to some extent, sort of computer science researchers. I think I do want to be engaged in policy questions, but I think that's something where, as a professor running a lab with primarily training in linguistics and NLP, I'm not in a great position to be running these efforts, since that's something where I'm happy to participate in processes that help make sure that policy responses to these things are well informed. I don't expect to be leading a lobby agenda there.

Chris Potts:I'm just laughing because I asked Percy the similar question and got a similar answer about the Center for Research on Foundation Models. He was like, "Yeah, this is important work, but I hope I get to just be a professor doing technical machine learning research." I hope there are people in the world who want to do the difficult thing of talking with politicians.

Sam Bowman:Yeah. Yeah. I think there is really a community popping up here. I think it's way smaller than it should be, and I'd encourage anyone listening who wants to go into technically informed governance of foundation models in a way that wrestles with the ethical issues, I think this is going to be super, super important. I think there is a community popping up, and I really have been making an effort to talk to people who are doing this work as often as I can and plan to ramp this up as I'm ending my sabbatical.

Chris Potts:Let me ask one more question, because I do like to tie this back to students and being researchers. My question for you is going to be what you're doing now to inspire students. The reason I ask is, I think it's a very exciting moment for all of AI. Lots of stuff is happening, but it can also feel stultifying. It can feel a little like it's the opposite of inspiring if you're trying to chart out a new research direction. I find myself falling into this trap of essentially reducing everything to some kind of in-context learning problem that people can try out on ChatGPT and see what happens.

I have some examples in my back pocket about things that are really different, like explainability and other things, but it does feel like a tough moment in this way. It's exciting and also feels in some ways limiting. So what are you doing with your students and more junior researchers now that you're a very seasoned tenured professor? What are you doing to inspire the more junior people you work with, especially the people who are starting grad school right now?

Sam Bowman:Yeah, yeah. I'll admit, I'm very grateful to be on sabbatical this particular year. I think this is a uniquely tricky problem right now, as a lot of these technologies are starting to really get attention and get the level of commercial interest they're getting. Yeah, I think it's hard. I think the big picture thing I'd say, and I wouldn't say this to everyone, I think there are definitely people, maybe especially grad students in linguistics, for example, who have a lot of other cool opportunities.

But I think the thing I'd flag for someone who's getting interested in foundation model-type work from a typical technical research perspective at this moment that can be viewed as sort of a disappointment – as the end of an era for research in some ways – but also as this incredible opportunity that we've been trying for generations to build systems that sort of understand language well enough for language to be the default interface to advanced computer systems.

We're basically there. It's definitely not perfect. It can be extremely rocky in some ways, but the building blocks for this are clearly coming into place. The challenge now is to actually figure out what to do with this, figure out what we can build with this, what we can responsibly build with this, and figure out how to do that. That seems like maybe a bigger, harder, scientifically meaty, conceptually meaty, problem than we've been used to grappling with. So again, in some ways I think there's kind of more opportunity than ever. It just doesn't always look training a model, building a targeted eval set, that maybe watching lectures from NLP courses five years ago might have suggested.

Chris Potts:Oh, that's interesting though. So like Richard Socher, when I interviewed him, which was a while back now before a lot of this hype, but he still said that he thought ... If I remember correctly, he thought that this was a kind of moment to do products – that you kind of go in and out where there's moments where research is the most important thing and there are moments when putting that into action in the form of products and things is the most important thing. And he felt like we'd reached a moment for products. Is that kind of what you were saying, or did you actually have research questions in mind?

Sam Bowman:I think both. I think this is true. I think the amount of applied work that is important and exciting and requires technical sophistication to do well is greater than ever. So to some extent the message is learn this stuff, learn it rigorously, know how to do it well, and then go out in the world and do something way more impactful than you might have been able to five years ago. But I think there is a big research angle to this as well. I think really charting out limitations of these systems and capabilities of these systems that might inform how they're deployed, how they're viewed, but that might not be rewarded by commercial incentives, I think is important work to do there.

Then I think a lot of the work on the messier more social parts of alignment. As we were talking about, the ethical issues around this, I think there are a lot of big open questions there. I think there are a lot of big open questions that really do reward having the deep technical context – knowing precisely what affordances these models will and won't have when thinking about how to steer the impacts that you have in the world, and that feels like scholarly work.

Chris Potts:Oh yeah. No, that feels wonderful and inspiring. I really appreciate that, and that's a wonderful way to end this really outstanding conversation. Thank you so much for doing this, Sam. I really appreciate this.

Sam Bowman:Yeah, thanks so much, Chris.