Our researchers enjoyed attending the 5th International Conference on Learning Representations (ICLR) in Toulon, France. Criteo Research was a Gold Sponsor of the conference this year. ICLR ran for three days, covering a variety of topics, including deep learning, representation learning, optimization, language modeling, and more.
A number of papers and talks stood out to us at the conference.
One of the hot topics at the conference was generalization in deep learning. Ben Recht gave an invited talk on this topic, while Understanding Deep Learning Requires Rethinking Generalization won a Best Paper award (slides from Chiyuan Zhang’s talk). Jorge Nocedal also touched on this topic is his contributed talk for Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. Taken together this work suggests that, among other things, deep nets may generalize well due in part to the optimization method used to train them, namely stochastic gradient descent (SGD). SGD tends to converge to flat local optima, rather than sharp local optima, and flat optima seem to result in better generalization.
As expected, (deep) reinforcement learning gathered a lot of interest. There was a lot of focus on methods to reduce the noise of the policy gradient method, such as by combining the policy gradient with action-values (see PGQ: Combining policy gradient and Q-learning or Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic). Another option envisioned for addressing the variance issue is to use proxy rewards coming from auxiliary tasks to improve the estimation of the main task (Reinforcement Learning with Unsupervised Auxiliary Tasks). The most interesting approach to this addressing this problem of the variance of the policy gradient based methods involves choosing preferential directions of exploration during the learning of the policy, such as: Improving Policy Gradient by Exploring Under-appreciated Rewards.
At Criteo, we have a big interest in language modeling. Indeed, this task is very similar to the task of modeling user-product interactions. A sequence of user interactions can be seen as a words sentence (each product that the user viewed or bought is a word). Based on this analogy, a lot of tools from language modeling can directly be applied to improve recommendation systems. A straightforward example is Prod2vec (where word2vec is applied to user products sequences), and we are currently trying to apply state of the art language modeling techniques to determine how they can improve our recommendation engine.
At ICLR, there were several papers that involved improving/simplifying LSTM/RNN models for language modeling.
In Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling, Hakan Inan et al. add to the usual cross entropy loss of their RNN model an additional term that takes into account similarities between words (based on a Word2vec model they previously trained on the same dataset). When the predictor generates a small error (predicting a word similar to the true word), the new term reduces the impact on the loss. We would be interested in seeing the impact of such a technique on our Prod2vec product similarities.
In another interesting paper (Frustratingly Short Attention Spans In Neural Language Modeling), Michal Daniluk et al. try to better understand attention mechanisms and see how they can be simplified to develop good language models. Attention mechanisms were introduced to better capture long term dependencies. By separating the roles of the attention mechanism, they show that on their Wikipedia dataset, only the five last tokens are useful for the long term representation.
We hope that these new tools for language modeling will also help to improve recommendation systems, and that we will be able to build on some of these contributions.
We were delighted to see our former colleague N. Le Roux presenting his paper: Tighter bounds lead to improved classifiers. In this paper, he questions the use of the log likelihood to learn good classifiers in terms of accuracy. He shows that the log likelihood is a good proxy for the accuracy metric (since it is tight and convex!) early on in the optimization process, but that it could be refined in the late stages, since the bound becomes looser. Hence, he proposed some new convex upper bounds: they are tighter than the log likelihood and can be used in the late stages of the optimization. He proves that this leads to better classifiers in terms of accuracy. He also shows based on these new bounds how to directly optimize the precision for a given recall, which is a metric that we care when we evaluate our models. We would be curious to see if this work could be extended to other metrics such as the utility loss (http://olivier.chapelle.cc/pub/utility.pdf), since it is one of the main metrics we are using to evaluate our bidding models.
On the subject of Domain Adaptation, the paper on Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning introduces a new way to penalize the differences in representing the source and target domains by means of order-wise moment differences of the associated probability distributions. The method is conceptually simple, while being faster than the state of the art (O (N(n + m)) for CMD vs O (N(n^2 + nm + m^2)) for Maximum Mean Discrepancy (MMD)) and leading to better results on two Domain Adaptation datasets, namely Office and Amazon reviews.
At the intersection of sequence modeling and domain adaptation, we noticed an interesting paper entitled Variational Recurrent Adversarial Deep Domain Adaptation, by Purushotham and coauthors. They propose adapting the Domain-Adversarial Neural Networks model of Ganin et al (2016) to RNNs. They use it on top of a variational RNN, and show they can do transfer learning while capturing complex temporal relationships on healthcare datasets.
In Deep Probabilistic Programming, Dustin Tran and his co-authors present Edward, a probabilistic programming language. Edward is built on TensorFlow, and provides functionality for probabilistic modeling, inference, and model evaluation. In addition to describing the design of the language, the paper includes a number of examples of how to implement several standard models, such as a variational auto-encoder, Bayesian RNN, and a Gaussian Mixture Model. We have recently experimented with using Edward for small-scale model prototyping and inference tasks, and look forward to using Edward in larger scale experiments.
The paper Unsupervised and Scalable Algorithm for Learning Node Representations shows how to use Skip-gram model to learn node vector representations that later could be used for link prediction and related tasks. It’s closely related to the Node2Vec algorithm published last year, but reports better results. Although the title says that it’s scalable, the experiments in both papers were conducted on graphs with at most 20,000 nodes and 200,000 edges, which is modest compared to the graphs that we usually have to deal with. Today we work with graphs of more than 2 billion nodes, but algorithms for link predictions that could scale to data of that size are limited in their ability to use the graph structure to make better predictions, since they generally consider only node-to-node relationships. It would be interesting to see further work on scalable algorithms that can efficiently leverage the internal structure of the graph.
One of our colleagues at Criteo Research, Minmin Chen, presented the paper Efficient Vector Representation for Documents through Corruption, which introduces a new model architecture to generate vector representations for text. It uses a corruption model that acts as a data-dependent regularizer, which favors low frequency / highly discriminative words, while forcing the embeddings of frequent and non-informative words to be close to zero. It produces word embeddings that significantly improve the embeddings learned by Word2Vec. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, it is very efficient in generating representations of unseen documents at test time, in contrast to existing methods such as paragraph vectors or skip-thought vectors. The model yields high-quality document representations for a wide variety of tasks such as sentiment analysis, document classification, and semantic relatedness.
Criteo Cocktail Party
On the first night of the conference, Criteo hosted a cocktail party at the National Maritime Museum in Toulon. A number of prominent individuals from academia and industry gathered to hear about research at Criteo, discuss machine learning and topics related to the conference, and get to know one another in an informal setting. Attendees also enjoyed some great food and cocktails, as well as impressive exhibits on French naval history. Hats off to our events team for organizing a great event.
We look forward to participating in ICLR 2018!