My First CVPR - Christopher Manning, Professor & Director @ Stanford AI Lab
Interview with Professor and Director of Stanford AI Lab, Christopher Manning during CVPR 2019. Manning goes over his journey in deep learning and shares his perspective on the future of NLP.
Dr. Christopher Manning is the Thomas M. Siebel Professor in Machine Learning in the Departments of Computer Science and Linguistics at Stanford University. He is also the Director of the Stanford Artificial Intelligence Lab and founder of the Stanford NLP group. He is a leader in applying Deep Learning to Natural Language Processing, including exploring Tree Recursive Neural Networks, neural network dependency parsing, the GloVe model of word vectors, neural machine translation, question answering, and deep language understanding. His research focuses on exploring software that can intelligently process, understand, and generate human language material.
During the CVPR 2019 conference, Professor Manning was an invited speaker for the workshop “Visual Question Answering and Dialog” (slides). The paper he advised “GQA, a new dataset for compositional question answering over real-world images” was also accepted by CVPR 2019. This episode is a live recording of our interview with Professor Manning at the conference. He shared inspiring thoughts on the research trends and challenges in computer vision and natural language processing and the progression to commercialization.
Watch the complete interview here:
Full Interview Transcripts:
Host: Chris, what a pleasure to have you with us here at CVPR! How's your day been?
Chris Manning: Thank you. I've had a great day attending the visual question answering workshop.
Host: Excellent. How was the response from the audience?
Chris Manning: I think that's a really strong community in that area. And so there are lots of interesting talks and lots of interaction. And so it's been a fun community to be part of.
Host: Great, a lot of learning, I'm sure as well.
Chris Manning: Sure.
Host: Great. So we're going to run through a number of questions here. You're a very accomplished scientist in NLP and machine learning. You started as a computational linguistics researcher. Can you briefly walk us through your journey towards deep learning?
Chris Manning: Sure. So I'm old enough that I was around and saw a little bit of the second time that neural networks gained popularity, the parallel distributed processing or connectionist era of the late 80s - early 90s. And I guess at that point, while I was a grad student at Stanford, David Rumelhart was at Stanford and I took his neural networks course. And so I sort of saw a little bit then, but it's not really something that became my research area.
So in terms of what happened in the late 2000s decade, then going into the early 2010s, really the proximate cause of getting into doing deep learning research was at that point, my office was next door to Andrew Ng, and Andrew became really enthusiastic about this being the way to make progress in general cognition. And I guess I got caught up in his excitement.
Host: And what did you see at that time as being the biggest challenge from a research community perspective?
Chris Manning: So I'm not sure if this was the big challenge in the research community perspective. But very clearly, for me, the starting off point was remembering going back to the 1980s, there had been a lot of controversy about how effective neural networks were as a model for human language. And one of the big things that seemed to be missing was that human languages have a compositional structure where words go into phrases which go into clauses and sentences that can embed in bigger sentences. And the kind of flat neural network architectures was fully connected layers of the 80s. And indeed, that was started to be used again around the late 2000s, that they didn't seem to give any way of doing a good job of modeling this hierarchical recursive structure of human language. So the thing I was initially really enthusiastic about was how to make progress on that. And so that for the work that was done at Stanford around 2009 to 2013, roughly a lot of which was done with Richard Socher. Really the dominant idea was how can we start building tree structured, recursive neural networks and exploring those ideas.
Host: And at the same time, as we looked at how industry was developing, and we were starting to get introduced to Alexa and so forth, what did you think about that transition from what was happening in the academic world and what was being reflected in the consumer world?
Chris Manning: So certainly something that's having a huge impact on NLP and NLP thinking has been the rise of these dialogue agents, things like Alexa and Siri. I mean, in the first instance, that didn't really have anything much to do with deep learning and NLP because essentially, all the work was these very hand scripted dialogue agents. So really, that was a reemergence of rule-based NLP, more than anything else. It wasn't even really machine learning, probabilistic NLP of the kind that I'd been mainly doing from around 1995 till 2010. As the years have gone by, there's been a lot of interest. And I've worked a little myself in looking at how to build neural dialogue agents. And I think that's an interesting area to push further. It's been a hard area. I guess last weekend at Long Beach, there was the ICML Machine Learning Conference and there was a tutorial by a couple of Microsoft researchers on building conversational dialogue agents. And I mean, one of the central points that they were wanting to make is that the reality still isn't that people can just train end-to-end complete neural dialogue agents and expect them to work, that all of the deployed systems as some kind of mixture between machine learning and neural components, and lots of stuff that's still quite hand coded.
Host: Yeah, absolutely. We're still a long way to go. A long way to go. So we look at where you are, currently you are leading Stanford AI lab. Given the breadth and depth of AI, can you tell us a bit about your current research focus?
Chris Manning: You mean for the research focus of Stanford AI Lab in general, not me?
Host: Yeah. For the Stanford AI Lab.
Chris Manning: So I'll answer to that. The truth is that Stanford AI Lab is a fairly loose structure. It’s just not the kind of place that there's a director whose job is to tell other people what research to do, that really individual faculty deciding for themselves what they want to do. And basically, that's what happens.
But nevertheless, I mean, you can see clear trends. So the influence of deep learning, in particular in machine learning in general, is just pervasive that now is sort of affecting almost all areas of what goes on. There are few places where it doesn’t, there's still quite a bit of robotics work done with hand crafted control systems. But the vast majority of work is machine learning and deep learning. There's been a huge growth in emphasis in areas like NLP and vision that I think they're now sort of almost dominant areas of the Stanford AI Lab, where that didn't used to be the case a decade ago.
Something else that exciting that happened in the last few years is that we've hired a couple of great new roboticists. And so there's a lot of new robotics life around the Stanford AI Lab. And then there are many other areas. So there's also a lot of work in machine learning. Some of it’s applied to particular problems like computational sustainability, some of it’s looking at reinforcement learning and its applications to education, things like that. So lots of different threads of work.
Host: Right. Let's focus on computer vision for a moment. When you think about the developments and the progress and maturity there, what do you still see as being the trends and challenges around computer vision?
Chris Manning: So yeah, there's just been enormous progress in computer vision. I mean, it's a field that has moved in fairly short order from sort of being able to do almost nothing, or what computer vision could really do a decade or so ago was put a box around the human face, and nothing much else worked; whereas now there are all sorts of applications of computer vision where it can do lots of useful things. But most of them in some sense, are still pretty low level. There're enormous commercial opportunities now. A lot of those are in the medical area. So really, any kind of medical imaging, we can now collect data, build deep learning systems, which can do our job as good or better than human doctors, noticing things and imaging, and that's great.
But the tasks we can do are still very low-level tasks, as soon as it requires more interpretive tasks of what you might call higher level computer vision, the kind of things that people do all the time when they're looking at a scene and understanding what's happening. If someone walks by here, they'll say: Oh, someone's being interviewed; whereas if you put this same scene into a computer vision system today, you'll get two people, flood lights, camera, you'll recognize some objects, and you might be able to get out that the two people are looking at each other, but you're not going to get out sort of an understanding of what's happening. So pushing into sort of actually understanding what's happening at a higher semantic level, I think that's one of the big frontiers of vision. And it's actually a frontier in which NLP and vision get closer to each other, because this gets us into areas of knowledge and understanding the world and interpreting what's going on, and that they become more similar regardless of whether you start with text or images.
Another big area is moving more back into, I guess, 3D vision of understanding the world and what's happening in the world and sort of connecting between just the 2D scenes that we see, and having models of 3D world which allows us to predict and understand what's going to be happening.
Host: I just saw something recently from one of the museums in Italy, where they had that the whole 3D interaction from the prehistoric times, it was pretty incredible to see how far we've come. But when we think about the applications in computer vision as well, we look at areas like autonomous driving, we can see again, the progress that has been made there over the last number of years has been pretty phenomenal, too. So what are your thoughts on that?
Yeah, so that's obviously been a huge area of progress and an area where people see huge commercial utility. I think, to some extent, we still need to be cautious. On the one hand, there's been a ton of progress and things almost work, and there're the Waymo cars driving around the Bay Area all the time. On the other hand, it's also been a proof of just sort of how many different special cases there are, which human beings are good at interpreting, because they have so much knowledge of the world and how it works and common sense; whereas we still have difficulty getting that sort of same kind of flexibility into autonomous driving systems where they're great at staying between lanes and driving down a road or a highway, but you're not clear that you want to be the flag person waving the flag directing cars when the autonomous vehicles are coming at you, because it might not interpret things right.
Host: It could be a different story or a different ending. Yeah, absolutely. I mean, we see there's still a long way to go there as well, when we think about level five driving and so forth. But I want to move back to NLP for a moment as well. We had a number of questions from our Robin.ly community when they hear that we were interviewing you today. So I have some more like technical-depth questions. So as you know, sometimes syntactic parsing was used in many feature based methods. Now the trend is that it’s being used less in feature engineering work, but more on the end-to-end system. So how do you see the future of research in syntactic parsing?
Chris Manning: So that's a good question. Also in some sense, I can feel a little awkward because yeah, absolutely, what the question says is right. For really nearly all the NLP history, it's been seen as foundational that something that you need to be able to do is work out the structure of the sentence as a syntactic parse. And that would be a basis in which to understand and interpret the sentence, and having that would help you do other things like machine translation. And lots of NLP researchers, including me, have spent lots of time working on better ways of parsing sentences. And the truth is that looking forward for many tasks, it's not clear that that's going to be directly useful that we've now seen this generation of deep learning systems, where people have got some tasks, whether it's question answering, or machine translation, and that you're training large neural network models with no explicit training on syntactic structure, and they work great, better than anything we had before. So you could feel that all of that research on doing syntactic parsing, perhaps was misguided.
And here are a couple of thoughts on that. I mean, one thought is that if you have a task where there's a huge amount of data, then I think it is true now that you can train a model end to end with no explicit syntactic structure, and it will do very well. But there are two things that go in the opposite direction. The first one is, why does it do very well? And actually something I've been working on recently with a new student, John Hewitt, is looking at some of these deep contextual language models like ELMo (“Embeddings from Language Models”) and BERT, which are trained on humongous amounts of textual data with no knowledge of syntactic structure. And the truth is that we've been able to show quite convincingly, that actually models like ELMo and BERT are learning syntactic structure, that they're trained on enough billions of words of text that they start to see the patterns and understand the utility of the patterns, and actually have syntactic structure in the models that they're learning that they're just inducing automatically.
So in some sense, these models are proving that what linguists say about syntax is approximately right, that recognizing the kind of structure of sensors and understanding what's a relative clause is actually important to be able to predict with language and these models learn that syntactic structure. So there will still be syntactic structure. It might just be, we're having the models learn autonomously. And in some sense, that's progress, because surely it's better if we can just do this machine learning and we'll probably end up with richer representations than humans’ that are sort of hand designing symbolic structures.
But the flip side is that that only works when you are training on enormous, enormous amounts of text. And so I think there are going to remain lots of other places where you're not in the situation where you can train on a billion words of text end to end. And then having syntactic structure is an extremely good prior that gives you a very good scaffolding for understanding things. So like even today, for one of the visual question answering talks that I was listening to, that they're aligning a scene graph of the visual scene with a dependency parse of sentences and showing how that give value for doing visual question answering task because you’re doing this aligning between sentence structure and scene graph structure.
And in almost any place where you have a limited amount of training data, you get good value from making use of having extra information about what the structure of sentences are and how words relate to each other. So I think there are still very many places where explicit parsing and syntactic structure will continue to be used.
Host: And it sounds like there's also great opportunity for diversity of thought and how to bring these together that is welcomed and helps us mature out at the research. So in terms of neural network based NLP systems, how should we incorporate knowledge base?
Chris Manning: That is also a good question. Again, I think it’s something that's not fully solved. I mean, the easy answer, and perhaps the best answer at the moment is to say, as well as having textual data that we can learn from and refer to in doing other tasks, we can have a knowledge base that we can make use of doing other tasks. And at the moment, the easiest way to realize that is to be able to have a model that puts attention over elements of a knowledge base. Attention has been a very successful technique in NLP in general, it also uses things like neural machine translation. But for things like reasoning and accessing knowledge, attention has been a great idea. And so there's been a lot of work on having things like key value neural networks, where you can look into a knowledge base using a key to look up information and get another value out of the knowledge base and bring it back in. And that's a very successful technique.
At the end of the day though, it still feels like there should be more because it seems like you should more directly be able to take knowledge and put it inside your neural network. And at the moment, I think we still don't have very good ways of doing that in the sense of saying here is prior stuff, we want to load the neural network with. So effectively, having this externalize knowledge that we have the neural network learn from or refer to, it has been the most successful method.
Host: I think we could talk for a long time and explore further, but I'm going to finish on this question for you. In terms of commercialization, where do you see the low-hanging fruits in natural language processing?
Chris Manning: I guess a lot of that depends on what area you're in and what you want to do, and things like that. There are particular applications obviously, for things like machine translation, where NLP in particular, and recently as neural MT have been super successful and work super well, but that area is only of interest to a limited range of companies. And I think for more general use cases, I think there's no doubt at all, that the most popular use case has been in the area of dialogue agents, that for many companies, interacting with customers is a huge cost; or a huge opportunity that if there's other areas like trying to line up new customers, that's an huge opportunity that is insufficiently realized, because there aren't enough human beings doing it. And if some of this work can be done with dialogue agents, for everything from lining up sales leads throughout the other end, dealing with customer support issues, that's an enormous opportunity.
And it's a place that's actually starting to succeed. Building successful knowledge-rich dialogue agent is still quite difficult. So there's nothing that has the expertise of a good human being. But on the other hand, there are lots of repeat questions and easy questions. And so to try and get to the point where you have a dialogue agent that can handle the easy 80% of questions, or can do the first round of sort of customer acquisition, then that's an enormously appealing area, which is applicable across a wide range of companies.
Host: Right. And then if you were to add the computer vision aspect into that, is there an integration of computer vision, NLP in real life applications that you're seeing now that excites you about what's to come?
Chris Manning: If I'm honest, I think not so much in commercial applications, because it feels like the clear commercial applications at the moment tend to be more disjointed because the clear commercial applications for vision or things, like doing any kind of imaging analysis to find things in it and things like that. I mean, there are clearly places where vision and language go together, whether it's sort of looking at things with your phone camera, and then getting descriptions of what you see, there's been discussion of work to help blind people and in ways like that, or can just be used to help tourists to tell them about what's going on. There're sort of nice applications that combine vision and language. I'm not sure there's really kind of a clear commercial killer application that's emerged at the moment.
Host: Not yet anyway. Chris, thank you so much for being with us here today at CVPR. Thank you so much for your time. We really enjoyed this discussion.
Chris Manning: Not problem. It’s been fun talking to you.