10 Must-read AI Papers

From AlexNet to GPT-3, we curate a list of 10 papers that mark significant research advancements in machine learning, deep learning, computer vision, NLP, and reinforcement learning over the past 10 years. Author presentation and detailed paper reviews are also included.

February 12, 2021
Research Spotlights

We have put together a list of 10 most cited and discussed research papers in machine learning that published over the past 10 years, from AlexNet to GPT-3. These are great readings for researchers new to this field and freshers for experienced researchers. For each paper, we provide links to the short overview, author presentations and detailed paper walkthrough for readers with different levels of expertise.

1. ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)  

Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton (University of Toronto)

Published in 2012 (NIPS 2012)

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.


Watch the paper explanatory video:

ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)  

2. Distributed Representations of Words and Phrases and their Compositionality (Word2Vec)

Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean (Google)

Published in 2013 (NIPS 2013)


The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

[Paper] [5-min Overview Video]

Watch the paper explanatory video:

Distributed Representations of Words and Phrases and their Compositionality (Word2Vec)

3. Playing Atari with Deep Reinforcement Learning (Deep Q Networks)

Authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller (DeepMind)

Published in 2013 (NIPS 2013)


We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

[Paper] [5-min Overview Video]

Watch the paper explanatory video:

Playing Atari with Deep Reinforcement Learning (Deep Q Networks)

4. Generative Adversarial Networks (GANs)

Authors: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio (Universite de Montreal)

Published in 2014 (NIPS 2016)


We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

[Paper] [NIPS 2016 Talk by Ian Goodfellow]

Watch the paper explanatory video:

Generative Adversarial Networks (GANs)

5. Deep Residual Learning for Image Recognition (ResNet)

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (Microsoft Research)

Published in 2015 (CVPR 2016)


Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

[Paper] [CVPR 2016 Talk by Kaiming He]

Watch the paper explanatory video:

Deep Residual Learning for Image Recognition (ResNet)

6. Dynamic Routing Between Capsules (CapsNet)

Authors: Sara Sabour, Nicholas Frosst, Geoffrey E Hinton (Google Brain Toronto)

Published in 2017 (NIPS 2017)


A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

[Paper] [3-min overview by Sara Sabour

Watch the 1-hour Talk by Geoffrey Hinton:

Dynamic Routing Between Capsules (CapsNet)

7. Attention Is All You Need (Transformer)

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Ashish Vaswani (Google Brain, University of Toronto)

Published in 2017 (NIPS 2017)


The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

[Paper] [1-hour Talk by Łukasz Kaiser]

Watch the paper explanatory video:

Attention Is All You Need (Transformer)

8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Google AI Language)

Published in 2018 (NIPS 2018)


We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

[Paper] [ACL Talk by Ming-Wei Chang]

Watch the paper explanatory video:

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingAuthors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Google AI Language)

9. Language Models are Few-Shot Learners (GPT-3)

Authors: Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (OpenAI)

Published in 2020 (NIPS 2020)


Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

[Paper] [7-min Overview]

Watch the paper explanatory video: 

Language Models are Few-Shot Learners (GPT-3)

10. Reformer: The Efficient Transformer

Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya (UC Berkeley & Google Brain)

Published in 2020 (ICLR 2020)


Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(LlogL), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences. [Paper]

Watch the 17-min Talk by Łukasz Kaiser:

Reformer: The Efficient Transformer

Check out our "Must-Read AI Papers" video collection for all talks and paper reviews.

Install "Crossminds Papers with Video" Chrome extension to instantly find research videos for AI papers on arXiv:

Crossminds Papers with Videos: Find research videos for AI research papers on arXiv While browsing arXiv.org, Crossminds papers with video extension instantly help you find research videos related to each paper. Powered by Crossminds.ai research video platform, this free extension currently covers over 6000 videos in artificial intelligence, machine learning, deep learning, computer vision, natural language processing (NLP), robotics and many other topics in the computer science domain.

Sign up with Crossminds.ai to get personalized recommendations of the latest tech research videos!

Join Crossminds Now!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form

Crossminds.ai is a personalized research video platform for tech professionals. We aim to empower your growth with the latest and most relevant research, industry, and career updates.