CS224U: Natural Language Understanding

Podcast episode: Douwe Kiela

April 18, 2022

With Chris Potts and Dhara Yu

Hugging Face, multimodality, data and model auditing, ethics review, adversarial testing, attention as more and less than you ever needed, neural information retrieval, philosophy of mind and consciousness, augmenting human creativity, openness in science, and a definitive guide to pronouncing "Douwe".

Show notes

Transcript

Chris Potts:All right. Welcome everyone. And welcome to Douwe Kiela. Douwe, it's wonderful to have you here. I'll just do a brief introduction, because I just love your background, especially in the context of this course.

So Douwe was an undergrad at Utrecht working in cognitive artificial intelligence and philosophy. After that, he did a Masters in logic at the world-famous University of Amsterdam, which has the premier center for studying formal logic in all the world, I would say.

After that, he did, I guess, another Masters of some kind as well as a PhD at the University of Cambridge. He was previously at Facebook AI Research and now he is the director of research at Hugging Face. And I'm delighted to say that he is newly an adjunct professor in Symbolic Systems here at Stanford.

And in terms of his work, man, if you name a topic in NLP, Douwe has worked on it. Representation learning, adversarial testing, NLI, benchmarking, neural information retrieval, hate speech, question answering, dialogue, and on and on. We could do our entire syllabus for this course just based on work that he's done. It's truly remarkable.

So Douwe, welcome. We have tons of stuff to talk about. And so I just want to dive right in with the first very important question. In the U.S., what name do you give at Starbucks? I'm guessing Don, but I'm not sure.

Douwe Kiela:No, actually, so I used to go for John just because I figured there's no point in even trying, but I've settled on Dow, D-O-W, so I spell it out. So Dow, D-O-W, and then they get it, usually.

Chris Potts:I like that.

Douwe Kiela:Yeah, that's a great question. So I have a pronunciation guide on my website. Usually every conversation I have that's introductory with anyone I meet for the first time, I explain how to pronounce my name. I tell them that it's not a bad thing if you can't pronounce it, it's not your fault, it's my parents'. And so if you want to know, it's Douwe and so it doesn't mean anything, it's a name from the north of the Netherlands, Frisia. I'm not even from there. My parents just liked the name.

Chris Potts:I did my level best with it, but I've noticed that I call you Dow in my head and when I meet you. And in my mind, I still spell it the way you spell your full name. And it's just some kind of mismatch for me, but I did my level best based on the pronunciation at your website.

Douwe Kiela:Yeah. So I tell people to just call me Dow as in Dow Jones. It's easy to remember, at least you get the vowels right. French people call me Duvet, which is not ideal.

Chris Potts:I was tempted to do that. I think I might have done that in the past and you were kind about it. But in all seriousness. Okay, so we should really dive in.

And I'm curious about the present moment for you. So you've just started as the director of research at Hugging Face. You're opening up a local office and leading a new team. Can you tell us a bit about what all this means?

Douwe Kiela:Yeah, so I joined in January, and my mandate is to build out a world-class research lab, which I'm very excited about. And it's an interesting time, I think, to do that because if you look at what's happening in the broader communities, there's a lot of people moving from the big tech corporations to cool hip startups and doing more interesting things. I think that's a general trend, maybe post-COVID, where there's a bit more mobility in the labor market.

And so yeah, what I'm trying to do here is really find the optimal middle ground between the FAIR style, bottom-up research, where people are just free to do whatever they want, which in my opinion leads to way too much politics. And if you have to compete for resources, then you get issues. An alternative is top-down, which is sort of Deep Mind / OpenAI style, which also doesn't really work because I think imposing constraints from the top down really hampers creativity. And I think if you want to have world-class research lab, then you should encourage creativity.

So we're trying to find this middle ground where we have projects that people work on where there's critical mass, so enough people are interested in working on this together and they just form a team and go and do it with the broader goal of trying to really change the world. So we don't want to just write little papers, we want to really have a positive impact on the world. So that's what we're building. We're about 25, 30 people now. As a company, we're all over the world, so we have people in, I think, 22 or 23 different countries now, which is amazing. We're 130 people, I think. So as you can imagine, we're very active on Slack and not very active in meetings, because we are all over the world.

The Palo Alto office is near California Avenue, if you ever want to stop by. It's just in the local WeWork there, it's still very small.

What's it like at the office? So I'm at home now. I was at the office this morning, so generally we're still all struggling with the return to office. It's very hard, even though the office is like a couple minutes from my house. Sometimes it's better to be home.

Chris Potts:So I want to talk about the research projects that you all have cooking, but I'm curious: why bother with an office? We were just talking about Zoom and the prevalence of Zoom still. Yeah, why bother with a physical space since some of the team might hardly ever visit it?

Douwe Kiela:Yeah, it's an experiment actually, so we're not sure if we're going to keep it, but I think there's something really different about in-person interaction and the creativity that arises at the coffee machine when you're just talking about random stuff. And so I'm finding it really hard for my team to create these sorts of spurious interactions that lead to creative new insights. With Zoom, there's always a purpose to the meeting and I think that's really not great for creativity.

Chris Potts:Right. Oh, I totally agree. Yeah. And so the way you described your project planning, so to speak, really resonates with me, because it's something I try to get right as an advisor too, where you want to set some terms and provide some guidance, without, though, the dictating the terms. How are you doing that? Are you just sort of musing aloud and seeing what sticks, or something more structured?

Douwe Kiela:Yeah, I try to influence not too much. We have some amazing people and I'm sure they can come up with this themselves, but we have some general things that we think are interesting. So I can tell you a bit about the projects that we're currently working on.

Chris Potts:Sure, sure.

Douwe Kiela:The advantage of being at Hugging Face is that we're very open about everything, including what we're working on. So yeah, if you have any questions I'm happy to share.

Chris Potts:I think that's wonderful. Yeah.

Douwe Kiela:We have five big projects at the moment.

One is Big Science, the large language model, which you've probably heard of.

Then we have a multimodal project, so sort of building on the FLAVA project we did when some of us were still at Facebook. We can talk more about that.

Then we have a project on simulated environments and embodied learning. So this is really kind of a moonshot, crazy idea project, which you probably wouldn't really associate with Hugging Face necessarily.

Then we have a project on retrieval, neural retrieval specifically.

And then the final project is organized around the AI ethics folks. We have some really world-class people like Meg Mitchell there. And so what they're working on is really trying to change the world there.

Chris Potts:Very cool. I mean, I could pick your brain about all of those projects. I guess the one that I want to seize on is the multimodal stuff. And could you just share: what's your philosophical perspective on this? Do our language models need to be multimodal to be true language understanding agents, or is this a practical thing? Yeah, just say a bit more.

Douwe Kiela:It's a combination of both, I think. So there's a practical argument just from the sample efficiency, data efficiency standpoint, where if you have multiple views from different modalities on the same abstract ideas, then you should be more data efficient. And so I think that's crucial in human learning and something that we don't really have that much of yet in machine learning. But yeah.

So I guess the other question is, do you need multiple modalities in order to get linguistic meaning in machines?

Chris Potts:Yeah. Yeah.

Douwe Kiela:And I would say yes. I don't know how much though. If you look at my past work, there's been a lot of work in just multimodality in your general. And I think if you want to get to meaning, you have to have some understanding of the world, and not just like the physical world, but the world as humans experience it.

You can come up with some easy examples, like the smell of coffee, everybody in this Zoom call knows what coffee smells like, but it's very hard to describe it in words, because it's captured in a very primitive part of your brain. So that means that you know that everybody knows it, so you never had to describe this sort of stuff. And that's a bit of a contrived example maybe, but I think a lot of that exists in language, where we just share all of this common experience of the world and we're always assuming that that common experience exists and that's why we can communicate efficiently with each other.

But in terms of machines: maybe not everything needs to be grounded. Even for humans, concepts like democracy are not grounded necessarily. I think they're just abstract concepts, maybe constructed out of concrete concepts if you believe in Lakoff and Johnson and that sort of stuff. But yeah, in order to get to true meaning, you need some experience of the world, but maybe not as much as some people think.

Chris Potts:So I'm actually on board with this, but I guess I'm just curious: there is evidence, for example, that people who have been blind since birth can reason in rich and accurate ways about color terms, even though they've never seen colors before, suggesting that you can achieve something that looks an awful lot like grounding without the actual sensory input that you would expect you would need from colors, and maybe the same as true for smell of coffee.

And so, I guess if you'd followed that line of reasoning, you could be open-minded that a pure text-only model could achieve something like grounding or at least simulate grounding.

Douwe Kiela:Yeah, but so I think visually impaired people still have other common ground -- common perception of our shared environment. And so that's through the auditory modality and proprioception and things like that. If you have at least a few of these kind of connections to a shared reality, and then you can communicate about them, then you can fill in the rest. So it doesn't have to be vision necessarily, but I do think there needs to be some shared perception of a common environment. If that doesn't exist, then you can't build the basic beginning of mappings between concepts and tokens. If you can't establish that first mapping, then I don't think you can learn language.

Dhara Yu:So we have a question from a student, Sylvia. So she's curious to hear what type of multimodality specifically, for example, video and audio, live streaming of data of people during a conversation, and what perhaps might be a good representation of, say the smell of coffee in the real world?

Douwe Kiela:Yeah. Okay. So that latter question is a question on its own. I have this paper from 2015 on the grounding in olfactory semantics.

Chris Potts:Do you, really? Oh, wonderful.

Douwe Kiela:So we've built "a bag of chemical compounds" model, where basically it's like a bag-of-words model, but it's co-occurrences of chemical compounds. And so surprisingly, actually those representations correspond very accurately with things that we associate with smells, like coffee and tea. And so if you look at the similarities between those representations, they're actually more meaningful -- so, more accurate than linguistic representations at the time were, so that's interesting.

But to answer the general question: I would like to see all modalities in a single model. And I think that's where we're going as a field. Foundation models and things like that are also connected to this idea of just having one model that does everything. I think that's where we are going to go as a field. The easy thing for us to do, I think, is to just have text speech, so audio, images, and then video, but then if you support text, you might as well also include code and things like that. And then maybe if you support vision, then maybe put in some graphs and 3D point clouds and etc.

Chris Potts:So what is the model that you're building? What are you going to pour into it and how big is it going to be? And when can I get my hands on it?

Douwe Kiela:Yeah, so it's hard to say. It might take a while. We're really starting from scratch here just from constructing the data set and things like that. We're trying to find out if there are ways to do this, where you can distribute the data. As you know, Hugging Face is all about open source and open science, and so we don't want to be like OpenAI where we construct this data set that we're not going to share with anyone else. And so it's pretty important that we can find a way to do this where at least other people can reproduce our results and things like that.

Chris Potts:Very cool.

Douwe Kiela:And so it is turning out, actually, that's not an easy question to answer. With text, there's a bit more freedom, I think, from a legal perspective, but for images, copyright is pretty aggressively defined from a legal perspective. And so distributing images like that is not easy.

Chris Potts:And so you mentioned work by Meg Mitchell and others. And I assume that means you're taking very seriously how you would curate even an internal data set that you would use if you were going to release a model. Do you have thoughts? What's the thinking around that at Hugging Face right now?

Douwe Kiela:Yeah, there's a lot of research just happening in this general space of: how does data translate to models and what are the considerations that need to go into this design process? And I think one of the things I would like to see from the specialists in that area is a more general guideline that practitioners can use, even if they're not really informed about the pure ethical considerations. There are so many of these questions. There's a whole community around this stuff. And I would like anyone to be able to just say, "Okay, these are the top five rules. And if I follow these, then at least I'm being considerate about this." And I think that's something that's still lacking a bit.

Chris Potts:That's so interesting. I mean, if there are guidelines, I would love to see them. I have a feeling that in some aspect though, I mean, we're going to be talking about gargantuan amounts of data. At some level, this is going to have to be a data science project that you launch, that's going to involve its own challenges just of scale, where you're trying to assess what are the biases, what are the signals? And the unknown unknowns might really haunt you as the scale goes up, and up, and up.

Douwe Kiela:It's been really hard even with Big Science, where I think we've been very careful there with the data that goes into that model, but even there it's like, it's just impossible at that scale to ensure you have clean high quality data that doesn't expose any privacy, sensitive information, that doesn't have any bad stuff, but you also need to factor in that just blank removing stuff is also not the way to do it. What people used to do is just remove all the words, keywords associated with pornographic websites, for example, which included term like "gay", so now suddenly your language model doesn't understand what "gay" means, which is a problem. So even in those sorts of questions, it's not easy to find the right answer.

Chris Potts:Oh, yeah. Those are great illustrations. That shows that simple filtering might end up systematically disadvantaging certain groups that you would like to have equal footing in the data set, precisely because of other factors and things like that. That's a reminiscent of lots of challenges that come from hate speech detection. And I guess it just shows you how hard this is actually going to be, so it's good that you have a team of experts working on it.

Douwe Kiela:Yeah. Yeah, exactly. We're using them as our internal ethics review board in a way, but yeah, I think they're also very vocal to the outside world about what things should be -- how we should do those things.

Chris Potts:Oh, that's so interesting that you framed it that way -- as an ethics review board. What are your thoughts about that, about ethics review around on publishing, but maybe also around releasing of artifacts? Should we be doing something more centralized or do you like the more free-form approach?

Douwe Kiela:Yeah, I like this idea. Maybe there should be an independent ethics review board body just in the general community. I think there have been attempts at doing this with ACL, for example, where there's a bit more emphasis on ethical reviews. But yeah, it's hard to say.

In general, I think we're in a bit of an interesting place in the field where people are now realizing how important this stuff is, which I think a lot of us have been saying for a long time, but now people are taking it more seriously. But the current generation of researchers isn't really used to thinking about things from an ethics-first perspective. And I think the new generation of students hopefully will be much better educated in these questions and really always have ethics in any decision they make as something that they think about.

Chris Potts:Totally. My own feeling about it -- I'll be interested in your reaction -- is kind of like, it's hard to get it right at the level of centralized review. And so the best thing that we could do -- like take for example a model release, is that first part you mentioned, release the data because that way people can audit the model and look for harms in the kind of free form creative way that experts can look for those things, but they can do it also in the dataset, so that they're not indirectly probing for evidence of something, but rather they can just do a simple audit of what we know is a crucial ingredient.

And then the other part would be just normalizing that people are going to discover things about your dataset and about your model that might be harmful, but we want to expose those things and that shouldn't mean that we don't release models, but rather that we just, as I said, normalize, kind of uncovering this information.

Douwe Kiela:Exactly. Yeah.

Chris Potts:Community-wide, and people can have their perspectives about what's important to report and what's unimportant to report. And in that kind of free-form thing, we might discover as a community what we absolutely need to be looking for in every single instance, for example.

Douwe Kiela:Yeah. Yeah, I agree with that. I think the only thing I would add there is that we need to make it easy for people. So I really like the idea of Model Cards, where we document this stuff really in a lot of detail, but if the Model Card is too long, then nobody's going to read the model card.

So I think where we should be going is, and this sounds a bit silly maybe, but a lot of people in our community are used to hill climbing on metrics, and maybe this should be a metric that they can hill climb on, or at least where we can tell people that, "Okay, if you are below this threshold, then you're not doing things the right way." And so that's a big problem with ethics, it's that: can you really capture it in a value that an engineer would be able to optimize for.

Chris Potts:So this is a nice transition into the topic of benchmarking in general. And I'm very curious for your thoughts on this whole topic.

Let me come at it in one particular way, which is: we were all involved -- you led us in this paper -- on the Dynabench platform. And we have a nice figure in the paper, which I shared with the students on the first day of class, where you show progress on benchmarks over time. And you can see that, for MNIST and Switchboard, it took 20 years to get above our estimate of human performance. For ImageNet, it took about 10. And then GLUE, SuperGLUE, in the present day, these things are conquered almost immediately. It takes less than a year for us to rocket past superhuman performance. And I feel like no one can deny that this is some measure of progress. There is something happening there. And I assume you agree with that, but how do you temper that? What's your perspective on that figure that we created?

Douwe Kiela:Yeah. I mean, we use the figure to motivate a different approach to benchmarking. The current benchmarking paradigm of having static test sets is great, it comes from a bit of an outdated idea, maybe, coming from like Vapnik-style, old school machine learning, where you have an i.i.d. overlap between your train and test distribution and you want to measure generalization. And so, within those constraints, we understand theoretically what we're doing, but beyond that, we don't really know.

Language is very interesting because language, the way we use it, is a device for strong generalization, where if I generate a sentence that has probably never been said before, like the current one, then you can still understand what it means. That's what we should be measuring, I think, with a lot of these NLU benchmarks. And I don't think that we are really doing that.

And so what we're driving with Dynabench, or what we're proposing with Dynabench, is to try to have humans and models in data collection loops where humans are trying to assess whether a model is actually really robust and whether a model fails maybe in unexpected ways, because we want to know about that if we are going to be deploying these models in really very critical situations. And so the validated model error rate that you get out of this data collection loop, I think, is a much more meaningful signal about the quality of the model than accuracy on a usually very arbitrary test set. So that's one thing.

And the other thing I think is that we should really be moving beyond accuracy as a single metric. Performance, when you ask somebody who's deploying an actual machine learning system, it's not only about accuracy, it's also about how efficient is your model, how fair is your model? Are there any legal risks about deploying your model? Are there any unexpected consequences? Is it really robust? There's all of these questions that you want an answer to, if you're going to be deploying a model in the wild.

And so we should be thinking about those metrics much more, and then we should be thinking about how do we aggregate across metrics to choose the model that is at the Pareto optimum across all of these different things we care about. So if you're an IoT engineer and you have some sort of on-device thing, you probably want to have a very efficient model and maybe you are willing to trade in more accuracy just for efficiency, but if you're an academic and you want to be number 1 on some leaderboard, then maybe you want to focus more on accuracy, but it's always a trade off. And I think that's very important for people to realize.

Chris Potts:And what is the role of adversarial testing in the context of Dynabench, but also conceptually for us given the goals that you just outlined?

Douwe Kiela:I think that adversarial testing is about this question of, can you deploy this model? In academia, I think a lot of people are not really thinking about the applications of the models and they're just thinking, "Okay, this is an interesting test case." And I think a lot of our older work was just interesting stuff to play with. And then suddenly it started becoming really useful and people started deploying these models, maybe in situations where they shouldn't have.

A classic example, also that Chris and I worked on, is sentiment analysis, where there are just sentiment analysis models, binary sentiment analysis models, deployed in banks and places like that, which were trained on movie data or restaurant data -- they don't know anything about financial information, but they're really being deployed there, or they were being deployed there, in just situations where they shouldn't have been deployed, I think.

If those places did some proper adversarial testing, they probably would've realized that there's something wrong with that approach, but because it's still such a young and new field, I think it's very easy for people to have low-hanging, easy impact without actually thinking about what they're doing.

Chris Potts:So that's super interesting to me, and that's certainly an aspect of my own thinking around adversarial testing, which is kind of like: avoiding embarrassment or worse when you deploy a system, and hoping that you find the scary gaps early. But what about the role of adversarial testing and improving models? I mean, what you just described is more or less just saying yes or no on whether we could safely deploy, but is there a path from adversarial assessments and training to actually making these models more robust? What's the state of evidence around that question right now?

Douwe Kiela:Yeah. That was the original hope for the Adversarial NLI project, for example. And Dynabench is also heavily inspired by this idea that, as you are doing the adversarial testing, over time you also collect a lot of useful training information, which you can then use to make your model stronger, and then you can put this new model in the loop again, to do another round of data collection adversarially. And so if your annotators are good enough, then you can get lots of very useful information that is very close to the decision boundary, meaning that you can get a better model out in the end.

In the adversarial NLI paper, we showed that this seems to be the case, but there's so much other stuff going on there that I think what was necessary there was to do a proper controlled experiment. So we did this, actually, in a recent paper on dynamic adversarial training data in the limit. So, analyzing adversarial training data in the limit, and we have a very nice picture there. Eric Wallace is the first author.

We show that if you do normal data collection, it works okay. If you do adversarial data collection, but you always keep the same model in the loop, then it does a bit better. And then if you keep updating the model every round, it does much better. And so this controlled setting is, I think, the first proof that this method really works. We need more validation of this hypothesis, but I think in the long term, if this is right, then people should stop doing standard data collection and they should only do model-in-the-loop data collection, because it's much more efficient -- it's cheaper and it also gets you much higher quality data that allows you to generalize better. So it's really a win-win from what we see.

Chris Potts:And there's just one thing that I want to make sure I understand there. So when you describe this model-in-the-loop process, it's creating a data set. Some of the created examples fooled the model and some didn't. You keep all of them, right? Not just the ones that did fool the particular model artifact they were interacting with?

Douwe Kiela:Yeah, we keep all of them. Yeah. I mean, different papers have different approaches. In some cases, if you want to construct a very specific evaluation set based on what current models fail at, then you could just take the ones that fool the model -- this is something we do for Adversarial VQA, for example. But if you want to use it for training data, it would be a bit of a waste to throw away this data that you paid annotators for and that is super useful still, even if it didn't fool that particular model.

Chris Potts:Right. And it might give you some kind of substrate in examples that aren't constructed specifically to fool a model. I mean, I guess that was the original intent, but since they didn't, they might just be normal sorts of examples that a system, deployed or not, might encounter. And that seems absolutely healthy.

Sam Bowman, as you know, has been critical of adversarial testing, but it's always adversarial filtering that he's critical of. And he seems to mean by that you would be keeping only the adversarial cases, whereas it seems to me, if you keep all the created examples coming from the dynamic you described, it addresses his concern, gives you a larger data set, and probably one that's going to be more robust for training and assessment.

Douwe Kiela:Yeah. Yeah. There's a subtle point there, though. Adversarial filtering, you could do that on the training data, or on the testing data, or on neither. In adversarial NLI, we do do adversarial filtering on the test set, so the test set just consists only of examples that fooled the model and that were verified by other annotators. The reason we did that there is because we really wanted to make this point that NLP, even though we're saturating all of these benchmarks, we are very far from solving these sorts of problems. It was meant to be a bit of a reality check as well for the field, which I think it was when it came out.

Chris Potts:So Douwe, the students want to know more about how Hugging Face, so we have a couple of questions and I do myself. Dhara, do you want to ask this question from Gabe about Hugging Face in particular?

Dhara Yu:Yeah. So the question from Gabe is, "What is the coolest application of Hugging Face's tools and services that you've seen, but you didn't expect to see happen?"

Douwe Kiela:Hmm. That's a great question. If I have to single out one thing: I don't know if people here have heard of Hugging Face Spaces, but it's a place for where you can host demos. So if you've heard of things like Streamlit or Gradio, it's a couple of lines of Python code, and then you can deploy your model to the cloud and let people interact with it. This is just super easy to demo your model and we've been telling people to use this.

We have one very famous Twitterer who works at Hugging Face called AK -- you might have heard of him -- and so he does a lot of these model deployments just for fun. And so he did AnimeGAN and ArcaneGAN and things like that. And those were a massive, massive hit. They really went viral way beyond the AI community. And so you can see this in our use usage statistics on the website, and there's just this insane spike where demos of AI go viral. And I think this is a very cool thing for the future.

CVPR actually, all of their demos are going to be hosted on Hugging Face Spaces. Maybe this is a broader theme, where I think it's really important that we make our technology accessible also to people who don't come from an AI background so that they can understand what AI is and what it isn't. And a lot of people really don't understand what AI is and I think that's something we can improve through democratizing machine learning, which is what we're doing.

Chris Potts:This makes my next question seem small, but I'm going to ask it anyway: the code base for this course, for this year, first time, completely dependent on Hugging Face for data sets as well as models.

Douwe Kiela:Awesome.

Chris Potts:Am I putting myself at risk?

Douwe Kiela:Ha! I don't see the risk. No. We take education very seriously. Again, going to this theme of democratization, I would go even further and expect that in the future, also computer vision and digital signal processing and other courses are also going to be using Hugging Face for their teaching and their materials. And I think that's a good thing. I mean, the only risk for you would be Hugging Face going outdated.

Chris Potts:Oh, wait. Outdated or offline?

Douwe Kiela:Yeah. Well, that is a risk. Yeah. So there's this example: in some countries where if Facebook goes down, then they start complaining that the internet doesn't work. And so maybe this will happen one day with AI researchers and Hugging Face.

Chris Potts:I think that, on the eve of an assignment being due, the equivalent of pulling a fire alarm before the exam for this course would be some kind of denial of service attack on your servers to buy students extra time, because if those servers go down, we're going to have to extend the deadlines. I'm not even sure our autograders would run anymore.

Douwe Kiela:There was actually a bug in transformers a while back where if Hugging Face went down, then you couldn't load your model, even if it was cached locally. So we fixed that bug now, so.

Chris Potts:I saw that. Yeah.

Douwe Kiela:Be safe.

Chris Potts:If other students have questions about Hugging Face, feel free to ask them now.

I have one more that's kind of related to Hugging Face. I'm going to ask this of all the people I do this talk series with. Have you seen your colleague at Hugging Face, Sasha Rush, has this bet going, is attention all you need? I'll put the link in the chat. I's particularly interesting because he's at Hugging Face, which sort of made its name as a place where you could work with Transformers, and he's really talking about Transformers. Where do you come down on the question of, "Is attention all you need?"

Let me read the question just so we have it in the background and just so we know the terms. The bet is about this proposition: "On January 1, 2027, a Transformer-like model will continue to hold the state-of-the-art position in most benchmark tasks in natural language processing." And they've got some details on "Transformer-like" and "most benchmark tasks", but I think we have a sense for it. Where do you put your money?

Douwe Kiela:I'm really confused by this bet because I don't know what "Transformer-like" means. And I also don't think that Transformers are necessarily the special thing here. Maybe this is a bit contentious too.

I think what makes this work is attention and attention was introduced in machine translation on LSTMs, and what the Transformers paper does is just get rid of the LSTMs and replace them with feed-forward layers. So I think the Transformers paper gets more credit than it deserves, but maybe that's a bit contentious. The credit should go to the people who invented attention.

But whether attention is all you need, I think in many cases, the answer is probably no for me. I don't know what Sasha means with "Transformer-like" exactly, but I think most of the layers -- and I think this is something that we're going to be seeing more of in the future -- most of the layers in Transformers don't have to have full-self attention, so you don't need to really have every token look at every token, or whatever the equivalent of the token is. I think we can get by with much more efficient ways of doing what is essentially a multiplication of components.

So attention is just a bottleneck. We're pushing everything through a softmax and then we're doing some multiplication based on these softmax weights. So my bet would actually be that you can get rid of attention in most of the layers of the Transformer and it will still work.

Chris Potts:That's interesting though, because that's like pushing in the other direction. I suppose if I was going to put words in their mouths, I would say the hallmarks of this model are surely (1) only positional encoding to keep track of word order. Really strange, but apparently it works. (2) The attention mechanism is purely this dot-product thing, as dense as you want it, and as many heads as you want. And then despite the name of the paper, (3) those dense feed-forward layers with some regularization and dropout and stuff like that. So if we took those to be the hallmarks -- the core claims of the model -- we're going to say models that really are just powered entirely by those three pieces. Will they continue to be state-of-the-art by 2027? For me that would be a little bit dismaying if the answer was yes, because those are very simple pieces and it would've meant that we were just obsessively reusing them for more than 10 years.

Douwe Kiela:I don't think that there's a huge gap between attention and convolutions. There's this work out of FAIR from a couple years ago on light convolutions, and that also works just fine for a lot of this stuff. So, barring some very specific long-range phenomena, you probably can get by with non-attention things too.

Probably the background for why Sasha set up this bet is because he's very interested in things like S4. I don't know if that qualifies as transformer-like, probably not. But he wrote this illustrated S4 guide, which is very interesting if you want to learn more about this. I think he's just saying like, there's a lot more to do.

But, in the end, it's all about mixing information and interactions between things. If you do a bunch of MLPs and you have some multiplicative component with some information bottlenecks, then it will work, but then yeah, is that Transformer-like or not? I don't know.

Chris Potts:So you're refusing to participate in the bet on the grounds that the terms are still unclear?

Douwe Kiela:Yes, I'm a philosopher by training still.

Chris Potts:I thought of another bet. I don't know that the terms will be clear, but I'm interested in this bet. So for foundation models, or let's say large-language models. Actually, let's stick to the language model case. There's a real push, I feel, in the field to use that model, both to process language -- to consume prompts and produce more texts (so to speak so to spea0)k -- but also to be the store of knowledge so that you would store effectively your whole web index, or your database, or whatever, in those same parameters that are trying to speak.

And I understand the purity of the vision because I guess that's the way the human brain is in some level, it's an all purpose device for this, although I outsource a lot of my knowledge. Whereas you could, in the context of neural information retrieval, take a different perspective and separate out the language component from the knowledge store.

Where would you put your bet there? What's going to be the best, say, question answering system over the web, one giant language model or something that has a neural IR component?

Douwe Kiela:I would vote for the neural IR component, but I'm a bit biased here because I was involved in one of the big famous models that does this, RAG, Retrieval-Augmented Generation. What I like about these hybrid or semi-parametric models as some people call them, is that you have some control over what is true or what you trust, so what you can do there is you can have a Wikipedia and everything that is in Wikipedia in your index is true, and then you can reason over that stuff. And that's very different from just throwing internet-scaled data at your language model, where you don't understand anything and then hoping that it will memorize the stuff that you want it to memorize. So I believe that the problem currently with these big generative models is that we don't really have any way of controlling their hallucination.

We also have some work where we show that if you want to control hallucination in dialogue models, one way to do that is to make them use a retriever, and then you can ground what the models say in what it has retrieved, which gives you some interpretability, because now you know why it's saying this -- it's because it's found this particular relevant passage in Wikipedia -- and you can also hopefully have a bit more trust in what you're going to say.

Where we might be going, and so you it's still a bit unclear, but I think a good vision would be to have a very lightweight set of readers. What happens in a company like Facebook is you have all of these different teams, and they all have different requirements for the problem they need to solve, so they're going to be training their own classifiers on top of something. These lightweight reader models, they could be retrieving from this giant index, which is very sophisticated and very high quality. This index, you can just keep adding information to, and then every once in a while, you're going to have to maybe retrain your whole index. You do that maybe once a week or once a month, and your lightweight readers, you can retrain them whenever you want, because they're lightweight. So shifting a lot of the representational power to the index, I think makes a lot of sense.

Chris Potts:And is Hugging Face going to support these models going forward more fully?

Douwe Kiela:Hopefully. Yeah. I mean, that's definitely on the roadmap. I'm very interested in this stuff, and we have a bunch of also world-class people there, like Nils Reimers, who did the Sentence-BERT paper.

Chris Potts:We have a brand-new homework that involves neural IR thanks to Omar Khattab's work. And it's such a thrill to download BERT from Hugging Face and then get to use it and see what it can do. I get the same thrill when I download Omar's ColBERT parameters, index some data, and then I can search into it. And the results are vastly better than you would get from a traditional search model. And I feel like, oh, I want Hugging Face to help everyone feel empowered in that same way with these indices, which are, a lot of them underlyingly Transformer models, but serve a very different purpose and bring a new kind of delight to me when I'm putting systems together.

Douwe Kiela:Omar is in this call, right? I mean, I would encourage them to submit a PR to make this possible for Hugging Face.

Chris Potts:That is the right answer from a company focussed on open source! That's right!

Maybe if we have time, we could return to how exactly you get involved in the open source effort, because it could feel a little bit daunting, but we have a bunch of questions. Let's see. Dhara, do you want to ask this other question from Gabe, which I will allow on the grounds that Douwe is a card-carrying philosopher. I mentioned his degree from Utrecht, so he's credentialed to answer this question.

Dhara Yu:Yeah. Another super fun question from Gabe. Language models and consciousness have been all up in the news recently, so the question is, "What are your odds that current large-language models have some form of consciousness?"

Douwe Kiela:Yeah, so that's an interesting question. I mean, as a philosopher, my answer is that consciousness is not defined well enough to answer that question. I also think that the debate there, when the tweet went out, about them being slightly conscious, I think it was a bit silly. But, from a philosophical perspective, if you ask people like David Chalmers, who's a world-famous philosopher at NYU, he would tell you that consciousness is a spectrum and that things like snails also have some form of consciousness. And so snails are slightly conscious. And if they are, then maybe language models are too.

Chris Potts:I saw a nice retort to this, which is that maybe a language model is slightly conscious the way a field of wheat is slightly bread. It feels like a category mistake.

Douwe Kiela:But I think I disagree with that. So a field of wheat is also slightly conscious, if consciousness is a spectrum.

Chris Potts:Oh no, but I said slightly bread. I don't know about the consciousness question. It might be more plausible that it's slightly conscious than slightly bread.

Douwe Kiela:Ha! So that's the real philosophical question! I think consciousness is an interesting thing. To give a bit more of a serious answer to this, I think one of the things that I find very interesting about language models is that we, as humans, are very confused about them. The reason we're confused, I think, is because we have a tendency to anthropomorphize everything. As the philosopher Daniel Dennett calls this: taking an intentional stance. So we do this to lots of different things like your microwave or your vacuum cleaner, you can give them a name and you're ascribing intentions to this device. This is a coping or survival mechanism, I think, for humans.

Dennett thinks that we're also taking an intentional stance towards ourselves and that this intentional stance, or as Douglas Hofstadter calls it, a strange loop -- that this is actually what consciousness is. And that's why we're so confused by language models because they're displaying a behavior that is quintessentially human, they're speaking language. So by necessity, it has to have some consciousness or there's some intention going on there. If you actually look at what the language model is doing, it's just predicting the next word.

Chris Potts:I can tell now that you could have been a philosopher and it's there in your background at Utrecht. I guess you were doing philosophy and maybe dreaming of becoming a philosopher.

Douwe Kiela:Yeah.

Chris Potts:What happened?

Douwe Kiela:I went to NYU actually to do a semester. This was when I was already a logic master student. I went to NYU to work with David Chalmers and a bunch of folks there. It was an amazing time. They had this seminar on the philosophy of language and the mind, which was really great. But that's also where I realized that just pure philosophy is probably not very interesting unless you can connect it to the real world.

One of the problems I've seen also with the philosophical debate around GPT-3, for example, has been just very misinformed where even people like David Chalmers don't really understand what they're talking about.

Chris Potts:And relatedly, you told me that you were almost a Symbolic Systems student. What's the story there?

Douwe Kiela:Yeah. After my master's in logic -- that's also actually a funny story, how I ended up at masters of logic. I met someone from Amsterdam at a pub quiz in Hanoi, in Vietnam. He had an "I love Amsterdam" T-shirt. I wasn't sure what I wanted to study for my Masters degree. So I started talking to this guy, and he was there to recruit students from Vietnam to come to Amsterdam to do this Masters. And he told me, "This is the most difficult Masters in the world." So then I said, "Okay, I'll sign up for that," so that's how I ended up doing the Masters of Logic.

After that, I wasn't really sure what I wanted to do again, but I thought, "Okay, if I'm a philosopher who can do logic, that's probably not the best thing. I need to get a real degree in something." So I applied to Stanford for Symbolic Systems and Cambridge for Advanced Computer Science, which sounds very fancy, especially considering that the PhD that comes after it is normal computer science, so it's no longer advanced.

What happened is, I got into both, but Stanford wanted to have a ton of money and Cambridge was going to pay me to come, which I didn't know was possible for a Masters degree, so they were funding my tuition, and housing, and things like that on the grant. I didn't realize, I think, at the time that it would maybe also have been possible for me to get some grant money at Stanford. I was just a very Dutch stingy person, so I went to the place that would pay me.

Chris Potts:That's a bit awkward for Stanford. For the thing about Amsterdam, and when I was a grad student, people used to say that the Netherlands has more logicians per capita than anywhere else in the world. And that ILLC is incredibly vibrant. The people there have done really incredible trail-blazing things, mostly grounded in formal logic. But I think, thanks to Johan van Benthem, who's partly here at Stanford, it was always with an eye toward applications and that made it feel especially exciting, because it wasn't just an exercise, but rather an exercise that became a tool that you could trust to do all sorts of really interesting things. What's the scene like there now? Do you know?

Douwe Kiela:It's still very active in logic and all of the related fields, so there's logic and information, and logic and language, and these different groups. And they're all talking to each other. And I think what makes that place special is also what makes Symbolic Systems special: it's highly interdisciplinary. And I really think that -- and I'm trying to apply this in the way I hire at Hugging Face, for example -- that interdisciplinarity is super important for making people well-rounded researchers. So I would much rather have someone with a non-traditional background who maybe did a bit of philosophy and some weird stuff and then became a machine learner, rather than someone who's just been studying machine learning from their undergrad onwards.

A lot of the famous logicians have now retired, and there's been a bit of a gap after they left, but they also have a good NLP group, so I think their NLP group is probably now bigger than the formal logicians there, but there's still a lot of interaction and so there's people working on social choice and pure computational theory and pure mathematics and all of that. So it's a special place.

Chris Potts:The younger generation there though, people like Robert van Rooij, have made really striking contributions to our understanding of goal-oriented dialogue and what we're doing when we ask questions. And I feel like in NLP, when we finally get serious about dialogue, it's going to be in part because we finally figured out how to make some connections with that formal work.

Douwe Kiela:Yep.

Chris Potts:We'll see.

Douwe Kiela:Yeah, I agree. I mean, that's a general observation, I think, that at some point in the future, in our field we are going to go back to all of that stuff and realize how much we've missed.

Chris Potts:Exactly. And this is a good chance to talk about this new SymSys affiliation that you have that we're all very excited about. Dhara, do you want to lead that off?

Dhara Yu:Sure. Yeah. One question that I think, as a former SymSys student I'm very curious about, is to hear your plans for engaging with, I guess, what I might call the other cognitive disciplines of SymSys beyond just the computer science portion. And you've spoken a little bit it to your philosophy background, so there's clearly an angle there, but for some of our other programs, how do you see your interaction manifesting?

Douwe Kiela:Yeah, that's a great question. I don't really know the answer to that. I think I'm going to learn by doing. But yeah, as I said, I think interdisciplinarity is really important. So I'm very interested in not just doing pure boring machine learning stuff, but trying to really think about what comes next. And I think a lot of the stuff that comes next is exactly coming from these interdisciplinary approaches. Does that answer your question?

Dhara Yu:Yeah. I mean, there's definitely a lot of unknowns and you've just started in your role. We have a question from a student that might be an interesting segue to talk more about the linguistics angle of the SymSys program. And the question, which is from Adolfo is, "Could multimodal language models be able to provide evidence for, or against, this classic poverty of stimulus argument in cognitive science and linguistics?"

Douwe Kiela:So what's the poverty stimulus thing? Just remind me, so I'm not saying something stupid.

Chris Potts:I would say that it's this general view that you find a lot in linguistics, and I think elsewhere in cognitive science, that there are some aspects of human behavior, especially linguistic behavior, that cannot possibly have been learned from experience because experience always underdetermines them. And therefore they must be innate in some sense.

Douwe Kiela:So it's the Chomskyan question!

Chris Potts:It is the key aspect of the Chomsky program. That's right. And the more evidence you could get for the so-called poverty of stimulus relative to what things people actually do, the more evidence you would have that language is a unique, innate capability of humans.

Douwe Kiela:Yeah. To go back to that question and multimodality -- I think I've been interested throughout my career in this idea of meaning in machines. And I think if you want to get to meaning in machines, you need to have multiple agents grounded in some environment that is shared between them. Not just two agents, but they should also be interacting with lots of other agents. And then they have compute constraints and then they have an evolutionary prior. And this evolutionary prior for humans, we've had, I don't know how many millions of years to develop this evolutionary prior. This is what you could consider this sort of linguistic innateness, so I am maybe less convinced than some Chomskyans that the innateness there is syntactic. A lot of the innateness might just come from how we organize concepts. So I would argue that it's not syntax, but semantics that is innate there, but I guess that's a very long debate.

Chris Potts:One example that Adolfo might have in mind is kind of like this: you say, we observe as linguists that across all the world's languages, there are only a few strategies that are employed for forming questions, that they obey a few general syntactic principles. And the evidence that we get as children always underdetermines those strategies, and children never find really unusual ones that are outside of this kind of constrained space. And therefore they must have been born in a state that primed them to consider all and only those hypotheses, and then the signal that helped them figure out which one their language employs.

In that context though, but when you look at what BERT and models like that have induced about syntax with no guidance of that sort, purely from distributional data, no supervision about syntactic structures, your confidence in this Chomskyan reasoning starts to waiver because it looks like a lot of that stuff was induced by this artifact, from the symbols it was trained on, and now the poverty of stimulus is much harder to push through as an argument.

Douwe Kiela:So I would like to see like a formal, information-theoretical proof that the stimulus is indeed poor. What you just said about distributional semantics is very true and that's just a single model. What happens with real humans is that all of these, if you think of models as agents, so they're all interacting with lots of other agents and they're getting lots of feedback from this. If we were to take a similar approach, and I think this is actually already happening, where we're learning from human feedback and things like that, then yeah, we're going to have even more of this syntactic information just emerge.

Chris Potts:It's interesting that you keep emphasizing the interactional component, because I would say the multimodal part for these models we're developing is really important, but maybe even more important is interaction, with maybe some notion of reward that's going to guide, what was it? A good interaction? Was it a bad one? That might actually be really essential.

Douwe Kiela:I think it's both of those things. If you have the interaction in a shared world with lots of different agents, then I think that's how we do it. And we can easily do this with machines.

I have some weird quirky papers where we do this with a translation game that also has images. You have your input in German and then you translate to English and then the goal for the listener agent is to reproduce it in French. What you can do with this communication channel is you can ground it in images, because it's a multimodal machine translation data set. And if you do the grounding in images, you actually make sure that the communication drifts less. That's what you want to have. So, if you just have two agents talking to each other and you have to guess what the mapping is of the symbols, it's very inefficient, but if it's grounded in some observation that's shared, then it becomes much more efficient.

Chris Potts:Oh, interesting. Cool. Cool.

Dhara Yu:I think that's an interesting segueway to another great question we have from Sylvia. The question is, "Do you think that logic and symbolic representations in general are important for NLP and ML?" And the follow-up question is, "How can we balance the massive amounts of data that we get versus the inherently costly process of building symbolic representations?"

Douwe Kiela:Yeah, I don't know. So I mean, it is definitely important, but I guess the underlying question is: when you think about neural symbolic AI, where does the interaction happen? I think people used to think that it was possible. And this is like in the early 1990s, even where you have these amazing connectionist logic networks where you can design your neural net to do certain kinds of logic. And it's really brilliant work, but that clearly doesn't scale, I think. But the alternative -- don't think we found the right alternative way for implementing logic on a neural net, but that is something that humans can do. If you think about the system one / system two distinction from Daniel Kahneman in Thinking Fast And Slow, I think neural nets are always about thinking fast, and that's why they have a lot of these biases and things like that, which are very similar to human heuristics. But we as humans are able to override those system 1 responses and neural nets can't do that yet. I don't really know what it's going to look like, but this definitely is something that we still have to try to tackle. And I think when you look at people like Gary Marcus this is essentially what he's saying: that we need to improve our symbolic reasoning in these systems. But how, that's the million dollar question.

Chris Potts:Let me bring some of these themes together, Douwe, because we've talked about SymSys, and this course, where all the students are going to develop original projects, and you've talked about how you're guiding projects at Hugging Face. And so I'm curious if you could just share general thoughts or advice for students who are trying to develop a project and hoping to be ambitious, but also wanting it to converge, in the context of Symbolic Systems advising you might do, or in the context of this course, or if they want to go on to publish a paper at ACL. What might you say to help them out?

Douwe Kiela:Yeah. So, first of all, if you are looking for a project, I have a long list of very cool ideas. And I'm willing also to share that if there's a way to do that with everyone, so people can take a look, because I'm generally, again, quite open with my ideas. I think ideas are cheap, but the execution is really what counts. And that's also the advice I would give students: try to be very ambitious. When you're at a university like this one, you can also afford to take some risk, but the way you take that risk is by taking a calculated risk where you have a fallback plan. So you can really try to frame a very ambitious hypothesis, but you need to make sure that even if you can't really answer that question, you can still fall back to an interesting insight -- some new knowledge that didn't exist yet in the world.

This could very easily be something descriptive, like if I compare this model to this model on this data set, what happens? Or things like that. And you could try to do those comparisons to answer a bigger question, but even if you can't answer the bigger question in full, then you still have something that you can hopefully publish somewhere, that you can talk about in your job talk or your interviews, and so that's what you should be aiming for.

Chris Potts:I've seen the list that you're referring to and it is very impressive. Could we calibrate a little bit? What is the safest project on the list in your view?

Douwe Kiela:Well, I wasn't prepared for this question.

Chris Potts:I don't have it in front of me, but I thought you might have a sense of like, "This one," if we turn some smart people loose on it.

Douwe Kiela:I can give an example of something that I think is interesting. So: few-hot learning and zero-shot learning are very interesting. This is in-context learning, this is what GPT-3 does, and a lot of the people are prompt hacking and things like that, and this is all the new hype.

Chris Potts:We're about to do that in this class for a homework.

Douwe Kiela:There's a great paper by my PhD student, Ethan Perez, called True Few-Shot Learning, where we showed that a lot of these few-shot models are actually not few-shot at all because they're tuned on giant dev set. I have one model that we do few shot with, I think it makes a lot more sense to take one model and then if you're going to tune your prompts anyway on 25,000 examples, you might as well train it on the 25,000 examples and I can guarantee you that it will be better.

Maybe the more interesting application of this few-shot stuff is to do early stopping based on few-shot performance, and then fine tune on the actual data. So this is a question about how do you pre-train models to be optimal for your downstream task? And this connection I think is very interesting.

The paper title would then be "Zero-Shot Stopping" or the term for the thing. And then you would do some analysis to see if that actually works. This list I gave is basically just paper titles, and this is how I like to come up with ideas -- like a very crisp, what would the paper title be for this idea? If you can capture your idea in a paper title, then it is probably an interesting idea.

Chris Potts:Oh, cool.

Douwe Kiela:And if you have to be very wavering and write a very long abstract, then you probably need to go back and make it more crisp.

Chris Potts:So that was an example of a project where you feel with a concerted effort, by a bunch of smart, invested people, something worth reporting is going to come out of it.

Douwe Kiela:Yeah.

Chris Potts:What's at the other end of the spectrum? What's one where someone could spend a year of their lives and at the end just have discovered that they shouldn't have done that?

Douwe Kiela:So maybe...

Chris Potts:It doesn't need to be that bad, but like...

Douwe Kiela:Yeah. So on my list, I think there are some interesting ideas about multimodality. I think you want a multimodal encoder. If you think about the big foundation model, some parts of it are unimodal and some parts are multimodal. I think you only want the multimodal encoder to be good at the things where the unimodal encoders are not good. So one way to do this is, if you have unimodal loss terms and multimodal loss terms, is to multiply the unimodal ones with the multimodal ones. This is called product of experts. It's from a Hinton 1992 paper or something. This really forces your model to only learn to be good at things in the multimodal case, where it can't solve them unimodally. This sounds like a very easy idea, very easy to try, but there's a lot of questions about the architecture and things like that, so this could easily be a year effort.

Chris Potts:It does sound ambitious. That's cool. Dhara, do you want to take us of this area of emerging communication?

Dhara Yu:Yeah. So this is switching gears a little bit here, but one topic that we don't really cover too much in this course, but nonetheless I think is relevant, is this idea of emergent communication, which, for those of you who haven't heard of it, is basically this idea of training agents to develop a language protocol sort of from scratch, from the ground up without any sort of supervision in the form of natural language.

I know, Douwe, that you have done some work in this area, so I'm curious to maybe get your philosophy of this field of work. Do you see it as a complementary or a competing thread to this large language model stuff? So yeah, maybe I'll just leave it there.

Douwe Kiela:Yeah. I don't think they're competing at all. I think they're perfectly complimentary. What I would like to see is have large language models talk to each other. If you do that, then you get a multiagent setup, and you could call that emergent communication if you train it from scratch. I think in practice, you would probably pre-train them first, and then you would give them interaction with each other. This is going back to the point we just talked about where a lot of meaning in natural language is the right from interactions. If you can initialize it with pre-training and then have interactions between models and maybe also some humans, then this would be really the way forward.

You can think about the Turing test and things like that, where you have models just talk to humans and see if you can tell the difference, and I think if we go to a future where we have lots of models in the loop with humans interacting with each other and learning from each other, that would be a next big step in NLP.

Dhara Yu:So I guess maybe as a quick followup to that is what do you see as some of the biggest challenges or impediments to achieving that grounded multi-agent setting essentially? Because, emerging communication, I think it's a great idea, but it hasn't necessarily proven to develop into very natural language-like protocols. So maybe taking that into account, what are some roadblocks?

Douwe Kiela:But that's not very surprising.Depending on how you set up your communication channel, I don't think anyone should expect natural language to just emerge naturally. If you just have two systems and they're doing stochastic gradient descent over a Gumbel-softmax communication channel, and you put constraints on the communication channel, you're going ot end up at some information theoretical optimum, like a Huffman code, and so obviously that's not very natural, and I don't think that's surprising at all. The way you would get to naturalness is either to pre-train with natural models or to also put models in the loop that were trained on some natural language, including humans, I guess. And then if you do that, and you train everything together, then you will get something that might look like natural language, but even there is not really clear.

Dhara Yu:I see. So the idea is sort of using the foundational kind of principle of emergent communication, but not necessarily the optimization techniques that are currently employed?

Douwe Kiela:Yeah. My frustration, and why I sort of stopped working on this line of work, is because I think, if you want to do pure emerging communication, then it really just becomes an optimization question. The way to solve those problems is through information theory. And I think a lot of the stuff that people have been writing papers about is very well known in information theory, if you just pick up any textbook. And again, maybe this is some of the anthropomorphizing that we're doing is as linguists, when we're looking at some of the communication protocols that emerge and we go, "Oh, it's amazing. It does this and that," but I think there are also just very good information theoretic explanations for why this sort of stuff emerges.

But I still think that this is definitely the way forward, but probably in a way where you inject the natural language and other evolutionary priors in some way, like our understanding of reality. Brandon Lake's paper about meaning and machine -- our intuitive psychology and things like that also play a big role in how we treat each other in the world.

Dhara Yu:Right. Yeah. It seems like everything just comes back down to multimodality in some sense.

Douwe Kiela:Multimodality and multiagent, yeah.

Chris Potts:And interaction. Right.

Douwe Kiela:Yeah.

Chris Potts:Very clear theme. That's cool. For the group. I have one more serious question to ask Douwe, and then a few that are just about him. And if you have questions that are about Douwe in particular, feel free to get them into the queue.

My one serious question is when I'm going to, again, ask everyone that I'm doing the series with. It seems to me that, throughout AI now, we constantly tell ourselves and the world how successful we are. And I suppose that's good for morale, and it does generate excitement, but if you had to convince a smart, but non-technical skeptic that we were as amazing as we claim to be, what would you use as evidence specifically for NLP?

Douwe Kiela:Yeah, so I don't know if I'm the right person to ask that question because I've been accused of under-hyping.

Chris Potts:Under-hyping!

Douwe Kiela:Yeah. So, I still think that we have a lot to do and we need to be a lot more careful with deploying these systems than we currently are, especially when it comes to fairness, and bias, and robustness, and all this stuff, where we have a very poor understanding of what these models are already doing, even if they on the surface look to be doing the right thing, they might be doing the right thing for the wrong reasons. But, if I were to overhype our field, I think what... I mean, specifically in NLP, it's a bit different, but I've been very impressed for example, by DALL-E 2. I think some of the stuff that that generates is really incredible. So I think that's interesting.

One of the projects I'm interested in, by the way, is trying to see if you can do some of these diffusion methods with them for NLP, which is a bit different, because you're in a discrete space. So if anyone is interested in that, reach out. In NLP specifically, I think I would tell people to go to Dynabench.org and talk to the QA model. SQuAD, which I think a lot of people here know about, I really think that SQuAD-like tasks have been basically solved on Wikipedia passages. So you really need to come up with very contrived, not very interesting examples if you want to fool the model there. In a way that's great, because we've solved this, but if you talk to the VQA model on Dynabench, you'll see that that doesn't really extend to other tasks at all. So we're really very good at this one particular thing, maybe because it's a little bit too easy, and we're still very far from general things.

Chris Potts:It is so striking to me that I think we all think we're utilitarian and trying to contribute to the nuts and bolts of the technological parts of our society, but all the really compelling examples are essentially about us helping people with creative self-expression, where there's no quantitative evaluation and cherry picking might be part of the point, because maybe an artist is curating, and tweaking, and the human component is very real there.

I'm not sure whether you saw this, but one of the 2022 best American essays in the anthology was written in partnership between a human and GPT-3. And the kind of hook of this essay is that this person was having a hard time writing about a death of someone in their family. And they turned to the model to help them get past these obstacles and the result is really compelling. And I share your intuition that the DALL-E pictures, they just look original, and fresh, and interesting, especially when they are curated.

Douwe Kiela:Yeah, that's beautiful. It reminds me of, I Am A Strange Loop, actually, the Douglas Hofstadter book where he talks to his dead wife through AI in a way.

Chris Potts:But for this hard-nosed skeptic, who's thinking about like, "Well, this other field cured COVID, and this other field launched a huge space telescope, and you're telling me that you can create stock images or help people write creative essays, and that's why you're congratulating yourselves all the time? Okay."

Douwe Kiela:Yeah. But they're also less interesting things. And so I mean, it depends on where you draw it. So if there's a distinction between artificial intelligence and machine learning, but I think machine learning has seen tremendous progress. It's been used in situations that we are so used to, that we don't even see them anymore, like your TikTok ranking algorithm, and your Facebook hate-speech detector, and that sort of stuff. This is something that just improves Facebook's bottom line. So a lot of these companies are very, very rich because mundane machine learning works very well, which means that you don't need to hire people to do all this stuff. You can just do it with machines.

Chris Potts:Right. I guess I did focus on NLP and that's totally fair about ML. And then for the cases that you mentioned, I always feel like -- maybe you disagree, but they cut both ways because they didn't solve the problem. And in fact, we know that they disadvantaged a lot of people, even as they were potentially getting some incremental gain in some direction, but sometimes that gain was actually exploitative of people. And so it's not the cure for COVID that it often sounds like we tell ourselves we've achieved in my analogy. Yeah.

Douwe Kiela:Yeah. Yeah. But I think one of the problems there is that we don't know how to define the cure for COVID. There used to be a Turing test or things like that, or Winograd schemas or things like that. We don't really have an equivalent of that, I think. Maybe something like a very useful assistant, like in the movie Her or something like that, which I think a lot of companies are trying to do now, actually, which is kind of creepy honestly.

Chris Potts:Relatedly, on this point of creative self-expression, a question about you, Douwe, in particular: would you read an autobiography that was written by a large language model? And if you would, what could it teach you?

Douwe Kiela:Ha! Yeah, so I don't like reading autobiographies. I like reading biographies where somebody else's take on a life, which I think is more interesting than somebody making themselves look in that possible way. But yeah, I don't know. I think if you cherry pick all of the outputs of these models, then it could be interesting, but I think we're still quite far away from having GPT-3 style models write a book that everybody would want to read.

Chris Potts:So you would just learn a bunch of boring stuff about the model.

Douwe Kiela:You still have to fix the hallucination problem first. So even if it was an autobiography and it was able to talk about itself, I don't know if I would trust it.

Chris Potts:But you could say that of people too. I feel like your answer is cutting both ways for people and for models all the time.

Douwe Kiela:Yeah. Yeah. That's why I don't like reading autographies.

Chris Potts:Here's a related question. Do you like video games or text adventure games?

Douwe Kiela:Mm-hmm (affirmative).

Chris Potts:Have you tried AI Dungeon?

Douwe Kiela:When it came out. I hear it's gotten a lot better. It looked cool. Yeah. But so I think it's a lot better now, so maybe I should try it again.

Chris Potts:I played pretty recently, but not super recently, but I felt kind of bored by it because I didn't really feel like there was a world I was exploring. It felt like someone was changing the rules constantly as a result of my interactions and I felt like if I trusted that there was a world model. If it was a pure foundation model that had constructed a world for me to explore, I would find that incredibly compelling, but since I felt like there was no coherent world model, I kind of felt hopeless as I was exploring it. And it was like there was no creative intelligence there that I was uncovering.

Douwe Kiela:It's interesting you say that. So we did this project called LIGHT, which was a text-based game where you can do dialogue and there actually is a grounded reality where you're in this sort of dungeon and there's like a sword and whatever, and it's in this particular room. And I think if you have that ground truth, then it becomes much more interesting, but it also immediately becomes a lot harder to scale.

So yeah, the problem we had there is we tried to hire someone to generate some stories for us, which we would then use to get annotators to have conversations in that world, and just having this one writer write this one story already was super expensive and in the end it didn't work at all.

But this would be interesting. And I think if we can do game design with AI and there are some people working on this, like I think Mark Riedl at Georgia Tech is doing some cool stuff there. And at NYU, Julian Togelius. So yeah, there is potential there. Yeah.

Chris Potts:So unless there are student questions, I want to move, I think, to my final question for Douwe. I want to end on a positive note. Maybe we didn't convince my imagined skeptic from before, but I do feel we're in a moment where we're empowered in ways that we never were before. And we can try experiments that were unthinkable even 10 years ago. And part of that, frankly, is the role that Hugging Face has played in democratizing a lot of these models and helping everyone see code, and learn from code, and contribute to code. And so just, by way of wrapping up here, do you have tips for people -- students especially -- who might want to get involved with this incredible open source effort that's pushing us all forward?

Douwe Kiela:Yeah, so I mean, there's a boring answer there, which is if you want to get started with stuff, but you don't know how to do that, if you go to the transformers repo on GitHub, there's a specific tag called good first issue and good second issue. And those are just small things to get you started on working on a code base like that. You can just start with a couple of those, and they're very easy to do, I think.

Another good thing to try to do is to add data sets or models to those repositories, because by doing that, you're enabling other people's research. So I think it's just a noble thing to do and to make everything accessible.

And a more high-level answer, I think, is that as a student, you should always be thinking about openness. And as a scientist in the future, I would encourage you to embrace openness and not go the closed way, where maybe you can make more money, but where you will be hampering scientific progress by keeping knowledge for yourself. The scientific endeavor is successful if we are all open about the progress that we're making and also open about the limitations in the work. So stay open.

Chris Potts:Stay open. I love it. Yes. And with luck, some of the work for this course that students produce will lead to some PRs to Hugging Face repos. That'd be wonderful. Well, thank you so much, Douwe. This was really wonderful.

Douwe Kiela:Thanks for having me.

Chris Potts:And thanks to everyone who participated! Yeah, this was great!