Apache spark machine learning for ctr prediction
-
Upload
imaginea-technologies -
Category
Data & Analytics
-
view
358 -
download
1
Transcript of Apache spark machine learning for ctr prediction
AD CLICK PREDICTION USING APACHE SPARK MACHINE LEARNING
1 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Advanced Machine Learning
Democratization of Machine Learning
Future of Machine Learning
Data Flywheels The Algorithm Economy Cloud hosted intelligence
Machine Learning Platforms
ML/AI at Center Stage
Introduction Trends in Machine Learning
ML Application Development: Systems that understand, learn, predict, adapt & potentially operate autonomously
Industries:Preventive HealthcareBankingFinanceMediaSupply Chain
2 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
AD CLICK PREDICTION USING APACHE SPARK MACHINE LEARNING
3 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
TABLE OF CONTENT
Context
Business problem
Challenges
Solution
Summary
4 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Industry Challenge“Predicting ad click–through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry” ~ Google
Context
▪ Ad platforms collect huge data to help them predict ad clicks
▪ A good predictive model is essential to serve ads efficiently to optimize over all economic value
▪ Sponsored search advertising, contextual advertising, display advertising, and real-time bidding auctions have all relied heavily on the ability of learned models to predict ad click–through rates accurately, quickly, and reliably
5 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
MotivationContext
Publisher Advertiser Bid ($) Predicted CTR Expected Bid
ESPN Nike 1 0.6 1 x 0.6 = 0.6
ESPN Gucci 2 0.1 2 x 0.1 = 0.2
Pay-per-click policy: an advertiser pays only to the extent that their ads are clicked by users
6 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
TABLE OF CONTENT
Context
Business problem
Challenges
Solution
Summary
7 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Ad-click prediction challengesProblem
How to build a predictive
model that …
Can deal with huge data volume
Has high predictive power
Is conducive to incremental learning
Deals with high dimensional sparse
data
8 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
TABLE OF CONTENT
Context
Business problem
Challenges
Solution
Summary
9 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Big DataNeed for a distributed data processing engine to handle volume, variety and velocity of data
Challenges
▪ Billions of ad-impressions served per day
▪ Millions of users and their history
▪ For any decent ad-exchange this data will be of 100s of GBs order of magnitude
▪ Almost impossible to crunch that much data on single machine shared memory model
The volume, variety and velocity of the incoming data makes distributed
data processing essential
10 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Data sparsityHow to efficiently handle sparse datasets?
Challenges
▪ Millions of categorical features
▪ Huge number (in millions) of potential features for each input vector
▪ But only limited set of actual features per vector
▪ Generally stored in a dense format but algorithms expect categorical data to be encoded
11 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Predictive modelsScalable and effective predictive models
Challenges
▪ Logistic regression models has historically been the workhorse for such tasks
▪ However, a number of studies in last few years have noted the effects of feature conjunction is important
▪ So need scalable non-linear models that can take feature interactions into account
Empirically, what models work best for this particular domain?
12 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Online learningPiecemeal learning
Challenges
▪ Update model weights looking one input vector at a time
▪ Ideally, avoid loading the whole input dataset into memory
Can the predictive model be trained on
streaming data?
13 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
TABLE OF CONTENT
Context
Business problem
Challenges
Solution
Summary
14 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Handle big & sparse data using Apache SparkSolution
▪ Distributed data processing
▪ Use Apache Spark “dataframes” API to build data processing pipeline
▪ Dataframes API are fully SQL compliant and highly optimized under the hood
▪ Flexibility to write own custom transformers and UDFs
▪ Tip: Use feature hashing to constrain the model size
15 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Online predictive modelsWhat we tried and what worked?
Solution
Logistic Regression XGboost on Spark FFM
Availability Spark ML Spark package Separate C++ library - libFFM
Online updates Possible No Yes
Features One hot encoded vectors
Counts based Custom encoded vectors
Distributed learning Yes Yes No
Outbrain Score 0.63 0.64 0.68
16 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Field-aware Factorization Machine (FFM)Solution
▪ For each unique feature, it learns a field aware vector representation
▪ Needs to see an input vector only once -weight updates one instance at a time
▪ Learns feature interaction very effectively
▪ Uses AdaGrad for matrix factorization
▪ Hyperparameters: k (weights vector length), η learning rate, λ (regularization parameter)
▪ But: shared memory algorithm, hard to implement distributed version
17 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Apache Spark + FFMBest of both worlds
Solution
Use Apache Spark for data joining, cleaning, transformation and featuring: fast and easy to use Dataframes API for the task
Use transformed data to train FFMs on a single machine
Alternatively, build a streaming pipeline to transform each incoming input into feature vector and send to FFM for model updates
Use trained FFM model for real time ad-click probability
18 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
TABLE OF CONTENT
Context
Business problem
Challenges
Solution
Summary
19 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
SummarySolution
Apache Spark is ideal for handling large datasets and explore them
But for specific cases like ad-click prediction where the data is very high dimensional (million
of features) and sparse (each instance only has tens of features), the current algorithms in
Spark ML/MLLib may not be upto the mark
Some external Apache Spark packages like XGBoost have been made available, so need to
use them whenever needed
Some highly effective algorithm like FFM are not yet on Apache Spark
But they can be easily integrated into the overall Apache Spark workflow to take advantage
of cluster resources - e.g for parameter tuning etc.
20 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
IMAGINEA TECHNOLOGIES: CORPORATE OVERVIEW
21 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Pramati’s M&A’s of Leading products
Serving from 5 Global Locations
Innovation Enablement
Over 200 Product Companies
Unique Products & Services
Agile Methodology
User-centric Design
Open Source Contributions
Products built from conception-code-cash
Imaginea: Agile Engineering Culture
22 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Our credentials
Building product on Spark since 2014
Contribution to Spark code: Spark Scala, Packaging Spark,
Compilation for Scala, API
Part of Spark team since 2013 while it was a Berkley project,
worked commercially with DataBricks for developing it
Over 20 patches to Apache Hadoop big data platform, worked
commercially with TubeMogul on video analytics
Contribution to Zeppelin code: JDBC Intepreter, Wildcard
Parsing, Integration
23 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
Our expertise
Data management
Data augmentation
Data ingestion optimization
Data filtering
Operations
Cluster management
Storage optimization
Solutions
Predictive search
Predictive Analytics
Interactive data exploration
24 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
FOLLOW US ON
25 Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve.
HAVE ANY QUESTIONS?
Just tweet your question with the hashtag
#AskImagineahttps://www.linkedin.com/company/imaginea
https://twitter.com/ImagineaTech
https://www.slideshare.net/Imaginea
For more details about Imaginea, visit www.imaginea.com or write to [email protected]
Disclaimer
This document may contain forward-looking statements concerning products and strategies. These statements are based on management's current expectations and actual results may differ materially from those projected, as a result of certain risks, uncertainties and assumptions, including but not limited to: the growth of the markets addressed by our products and our customers' products, the demand for and market acceptance of our products; our ability to successfully compete in the markets in which we do business; our ability to successfully address the cost structure of our offerings; the ability to develop and implement new technologies and to obtain protection for the related intellectual property; and our ability to realize financial and strategic benefits of past and future transactions. These forward-looking statements are made only as of the date indicated, and the company disclaims any obligation to update or revise the information contained in any forward-looking statements, whether as a result of new information, future events or otherwise.
All Trademarks and other registered marks belong to their respective owners.
Copyright © 2017, Imaginea Technologies, Inc. and/or its affiliates. All rights reserved.
Credits
Images under Creative Commons Zero license.
Private and confidential. Copyright (C) 2017, Imaginea Technologies Inc. All rights reserve. 26