Click Through Rate prediction at scale using Open Technologies

By: Amin Mantrach / 29 Mar 2018

With the recent advances in machine learning platforms, a lot of attention is now on learning at scale. In this context, the public Criteo dataset referred to as the Terabyte Click Logs [1] is now the dataset of reference when it comes to assessing the scalability of machine learning platforms and algorithms.

In early 2017, Google showcased the Google Cloud Platform by learning a click through rate (CTR) prediction model on the Criteo Terabyte Click Logs  [2]. Their solution relied on cloud proprietary technology, as well as the open-source Tensorflow framework.  More recently, IBM benchmarked their proprietary machine learning platform SnapML [3] against Tensorflow on Google cloud. In order to prove scalability, the Terabyte Click Logs was also used in this benchmark.

While the proposed solutions are scalable and reach state-of-the-art performance, they rely on proprietary cloud platforms. In this post, we propose an alternative solution using the open-sourced Tensorflow on Spark [4]. By doing so, we walk the user through a solution that relies only on open sourced technology. The end user can test our solution directly on his own cluster  or any available platform (Azure, AWS, GCP, or private grid). Using Tensorflow on Spark we show that we can reach the same level of prediction performance. Note that the emphasis of our solution is on using open sourced technology to meet the predictive performance benchmark and not on beating the training time.

We are also releasing our code that implements a CTR prediction model trained on Criteo Terabyte Click Logs. We use the setup in [2], wherein the first 23 days serves as training, and the last day serves as validation. Similarly, we implement a logistic regression model with features extracted in a similar fashion (i.e. bucketization, hashing and crosses). We show that we can reach similar performance, i.e. cross-entropy loss of 0.1293 on the test period.

Code available here: https://github.com/criteo/CriteoDisplayCTR-TFOnSpark

Acknowledgements to Oleksandr Pryimiak for reviewing the code.

[1] http://labs.criteo.com/2013/12/download-terabyte-click-logs/

[2] https://cloud.google.com/blog/big-data/2017/02/using-google-cloud-machine-learning-to-predict-clicks-at-scale

[3] https://www.ibm.com/blogs/research/2018/03/machine-learning-benchmark/

[4] https://github.com/yahoo/TensorFlowOnSpark