Apache spark machine learning for ctr prediction

26
AD CLICK PREDICTION USING APACHE SPARK MACHINE LEARNING 1 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Transcript of Apache spark machine learning for ctr prediction

AD CLICK PREDICTION USING APACHE SPARK MACHINE LEARNING

1 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Advanced Machine Learning

Democratization of Machine Learning

Future of Machine Learning

Data Flywheels The Algorithm Economy Cloud hosted intelligence

Machine Learning Platforms

ML/AI at Center Stage

Introduction Trends in Machine Learning

ML Application Development: Systems that understand, learn, predict, adapt & potentially operate autonomously

Industries:Preventive HealthcareBankingFinanceMediaSupply Chain

2 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

AD CLICK PREDICTION USING APACHE SPARK MACHINE LEARNING

3 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

TABLE OF CONTENT

Context

Business problem

Challenges

Solution

Summary

4 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Industry Challenge“Predicting ad click–through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry” ~ Google

Context

▪ Ad platforms collect huge data to help them predict ad clicks

▪ A good predictive model is essential to serve ads efficiently to optimize over all economic value

▪ Sponsored search advertising, contextual advertising, display advertising, and real-time bidding auctions have all relied heavily on the ability of learned models to predict ad click–through rates accurately, quickly, and reliably

5 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

MotivationContext

Publisher Advertiser Bid ($) Predicted CTR Expected Bid

ESPN Nike 1 0.6 1 x 0.6 = 0.6

ESPN Gucci 2 0.1 2 x 0.1 = 0.2

Pay-per-click policy: an advertiser pays only to the extent that their ads are clicked by users

6 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

TABLE OF CONTENT

Context

Business problem

Challenges

Solution

Summary

7 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Ad-click prediction challengesProblem

How to build a predictive

model that …

Can deal with huge data volume

Has high predictive power

Is conducive to incremental learning

Deals with high dimensional sparse

data

8 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

TABLE OF CONTENT

Context

Business problem

Challenges

Solution

Summary

9 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Big DataNeed for a distributed data processing engine to handle volume, variety and velocity of data

Challenges

▪ Billions of ad-impressions served per day

▪ Millions of users and their history

▪ For any decent ad-exchange this data will be of 100s of GBs order of magnitude

▪ Almost impossible to crunch that much data on single machine shared memory model

The volume, variety and velocity of the incoming data makes distributed

data processing essential

10 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Data sparsityHow to efficiently handle sparse datasets?

Challenges

▪ Millions of categorical features

▪ Huge number (in millions) of potential features for each input vector

▪ But only limited set of actual features per vector

▪ Generally stored in a dense format but algorithms expect categorical data to be encoded

11 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Predictive modelsScalable and effective predictive models

Challenges

▪ Logistic regression models has historically been the workhorse for such tasks

▪ However, a number of studies in last few years have noted the effects of feature conjunction is important

▪ So need scalable non-linear models that can take feature interactions into account

Empirically, what models work best for this particular domain?

12 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Online learningPiecemeal learning

Challenges

▪ Update model weights looking one input vector at a time

▪ Ideally, avoid loading the whole input dataset into memory

Can the predictive model be trained on

streaming data?

13 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

TABLE OF CONTENT

Context

Business problem

Challenges

Solution

Summary

14 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Handle big & sparse data using Apache SparkSolution

▪ Distributed data processing

▪ Use Apache Spark “dataframes” API to build data processing pipeline

▪ Dataframes API are fully SQL compliant and highly optimized under the hood

▪ Flexibility to write own custom transformers and UDFs

▪ Tip: Use feature hashing to constrain the model size

15 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Online predictive modelsWhat we tried and what worked?

Solution

Logistic Regression XGboost on Spark FFM

Availability Spark ML Spark package Separate C++ library - libFFM

Online updates Possible No Yes

Features One hot encoded vectors

Counts based Custom encoded vectors

Distributed learning Yes Yes No

Outbrain Score 0.63 0.64 0.68

16 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Field-aware Factorization Machine (FFM)Solution

▪ For each unique feature, it learns a field aware vector representation

▪ Needs to see an input vector only once -weight updates one instance at a time

▪ Learns feature interaction very effectively

▪ Uses AdaGrad for matrix factorization

▪ Hyperparameters: k (weights vector length), η learning rate, λ (regularization parameter)

▪ But: shared memory algorithm, hard to implement distributed version

17 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Apache Spark + FFMBest of both worlds

Solution

Use Apache Spark for data joining, cleaning, transformation and featuring: fast and easy to use Dataframes API for the task

Use transformed data to train FFMs on a single machine

Alternatively, build a streaming pipeline to transform each incoming input into feature vector and send to FFM for model updates

Use trained FFM model for real time ad-click probability

18 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

TABLE OF CONTENT

Context

Business problem

Challenges

Solution

Summary

19 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

SummarySolution

Apache Spark is ideal for handling large datasets and explore them

But for specific cases like ad-click prediction where the data is very high dimensional (million

of features) and sparse (each instance only has tens of features), the current algorithms in

Spark ML/MLLib may not be upto the mark

Some external Apache Spark packages like XGBoost have been made available, so need to

use them whenever needed

Some highly effective algorithm like FFM are not yet on Apache Spark

But they can be easily integrated into the overall Apache Spark workflow to take advantage

of cluster resources - e.g for parameter tuning etc.

20 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

IMAGINEA TECHNOLOGIES: CORPORATE OVERVIEW

21 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Pramati’s M&A’s of Leading products

Serving from 5 Global Locations

Innovation Enablement

Over 200 Product Companies

Unique Products & Services

Agile Methodology

User-centric Design

Open Source Contributions

Products built from conception-code-cash

Imaginea: Agile Engineering Culture

22 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Our credentials

Building product on Spark since 2014

Contribution to Spark code: Spark Scala, Packaging Spark,

Compilation for Scala, API

Part of Spark team since 2013 while it was a Berkley project,

worked commercially with DataBricks for developing it

Over 20 patches to Apache Hadoop big data platform, worked

commercially with TubeMogul on video analytics

Contribution to Zeppelin code: JDBC Intepreter, Wildcard

Parsing, Integration

23 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

Our expertise

Data management

Data augmentation

Data ingestion optimization

Data filtering

Operations

Cluster management

Storage optimization

Solutions

Predictive search

Predictive Analytics

Interactive data exploration

24 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

FOLLOW US ON

25 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.

HAVE ANY QUESTIONS?

Just tweet your question with the hashtag

#AskImagineahttps://www.linkedin.com/company/imaginea

https://twitter.com/ImagineaTech

https://www.slideshare.net/Imaginea

For more details about Imaginea, visit www.imaginea.com or write to [email protected]

Disclaimer

This document may contain forward-looking statements concerning products and strategies. These statements are based on management's current expectations and actual results may differ materially from those projected, as a result of certain risks, uncertainties and assumptions, including but not limited to: the growth of the markets addressed by our products and our customers' products, the demand for and market acceptance of our products; our ability to successfully compete in the markets in which we do business; our ability to successfully address the cost structure of our offerings; the ability to develop and implement new technologies and to obtain protection for the related intellectual property; and our ability to realize financial and strategic benefits of past and future transactions. These forward-looking statements are made only as of the date indicated, and the company disclaims any obligation to update or revise the information contained in any forward-looking statements, whether as a result of new information, future events or otherwise.

All Trademarks and other registered marks belong to their respective owners.

Copyright © 2017, Imaginea Technologies, Inc. and/or its affiliates. All rights reserved.

Credits

Images under Creative Commons Zero license.

Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve. 26