Criteo Research is pleased to announce the release of a new dataset to serve as a large-scale standardized test-bed for the evaluation of counterfactual learning methods. At Criteo Research, we have access to several large-scale, real-world datasets that we would like to share with the external research community with the goal of both advancing research and facilitating an easier exchange of ideas. The dataset we are releasing has been prepared in partnership with Cornell University (Thorsten Joachim’s group) and the University of Amsterdam (Maarten de Rijke’s group).
Effective learning methods for optimizing policies based on logged user-interaction data have the potential to revolutionize the process of building better interactive systems. Unlike the industry standard of using expert judgments for training, such learning methods could directly optimize user-centric performance measures, they would not require interactive experimental control like online algorithms, and they would not be subject to the data bottlenecks and latency inherent in A/B testing.
Recent approaches for off-policy evaluation and learning in these settings appear promising [1,2], but highlight the need for accurately logging propensities of the logged actions. With this dataset, we provide the first public dataset that contains accurately logged propensities for the problem of Batch Learning from Bandit Feedback (BLBF). Since past research on BLBF was limited due to the availability of an appropriate dataset, we hope that our test-bed will spur research on several aspects of BLBF and off-policy evaluation, including the following:
- New training objectives, learning algorithms, and regularization mechanisms for BLBF;
- Improved model selection procedures (analogous to cross-validation for supervised learning);
- Effective and tractable policy classes for the specified task; and
- Algorithms that can scale to massive amounts of data.
We also wrote a paper  to provide further insight into this dataset. In this paper, which will be presented at the What If workshop at NIPS 2016, we propose an evaluation methodology for running BLBF learning experiments and a standardized test-bed that allows the research community to systematically investigate BLBF algorithms. We also show results comparing state-of-the-art off-policy learning methods like doubly robust optimization , POEM , and reductions to supervised learning using regression baselines. Our results show, for the first time, experimental evidence that recent off-policy learning methods can improve upon state-of-the-art supervised learning techniques on a large scale, real-world data set.
This dataset, which is 250GB in size with 100 millions examples, is hosted on Amazon AWS and is available to the public. Please make sure to check the paper  before using this dataset as it is easy to misuse it. If you use the dataset for your research, please cite the source and drop us a note on your research at firstname.lastname@example.org.
 Counterfactual reasoning and learning systems: the example of computational advertising. L. Bottou et al., JMLR 2013.
 Batch learning from logged bandit feedback through counterfactual risk minimization. A. Swaminathan et al., JMLR 2015.
 Doubly Robust Policy Evaluation and Learning. M. Dudík et al., ICML 2011.
 Large-scale Validation of Counterfactual Learning Methods: A Test-Bed. D. Lefortier et al., NIPS What If Workshop on Inference and Learning of Hypothetical and Counterfactual Interventions in Complex Systems, 2016.