© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Sr. Product Manager, Amazon EMR
Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group
November 29, 2016
MAC303
Zillow Group: Developing Classification and
Recommendation Engines With
Amazon EMR and Apache Spark
What to Expect from the Session
• Apache Spark and Spark ML overview
• Running Spark ML on Amazon EMR
• Interactive notebook options
• Building recommendation engines at Zillow Group
Spark for fast processing
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in DataFrames in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark ML addresses the full ML pipeline
- Built on top of DataFrame API
- Extract, transform, and select features
- Distributed algorithms
- Classification and Regression
- Clustering
- Collaborative Filtering
- Model selection tools
- Pipelines
Process Data
Feature Extraction
Model Training
Model Testing
Model Validation
Extracting features in DataFrames
- Feature Extractors
- CountVectorizer
- Feature Transformers
- Tokenizer
- Binarizer
- StandardScaler
- Feature Selectors
- VectorSlicer
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Training data
Bank loan
write-off
predictions
Classification algorithms in Spark ML
- Logistic regression
- Decision tree classifier
- Random forest classifier
- Gradient-boosted tree classifier
- Multilayer perceptron classifier
- One-vs-Rest classified
- Naive Bayes
What is logistic regression?
What are decision trees?
Weather predictors for Golf
Decision trees: tree induction
Decision trees: partition data with hyperplanes
Spark ML pipelines - training
Spark ML pipelines - testing
Creating a Spark ML pipeline
val pipeline = new Pipeline().setStages(Array(assembler, indexer, dt))
val model = pipeline.fit(df)
val predictions = model.transform(df)
Save and load machine learning models and full Pipelines
Tools to pick the right model
- CrossValidator and TrainValidationSplit select the Model
produced by the best-performing set of parameters
- Split the input data into separate training and test
datasets
- For each (training, test) pair, iterate through the set of
ParamMaps
- Fit the Estimator using those parameters, get the fitted
Model, and evaluate the Model’s performance using the
Evaluator
Why Amazon EMR?
Easy to UseLaunch a cluster in minutes
Low CostPay an hourly rate
Open-Source VarietyLatest versions of software
ManagedSpend less time monitoring
SecureEasy to manage options
FlexibleCustomize the cluster
Develop fast using notebooks and IDEs
• Run Spark Driver in
Client or Cluster mode
• Spark application runs
as a YARN application
• SparkContext runs as a
library in your program,
one instance per Spark
application.
• Spark Executors run in
YARN Containers on
NodeManagers in your
cluster
• Access Spark UI through
the Resource Manager
or Spark History Server
Spark on YARN
Spark UI
Monitor your Spark jobs
Auto Scaling for data science on-demand
YARN metrics
Coming soon: advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal AZ based on capacity/price
• Spot Block support
Productionizing your pipeline
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Recommendation Systems @
Zillow GroupJasjeet Thind
Sr Director, Data Science & Engineering
Agenda
Intro to Zillow Group
Recommendation Use Cases
Architecture
Algorithms
Training & Scoring Pipeline
Metrics
Zillow Group
Build the world's largest, most trusted, and vibrant home-related marketplace.
Recommendation use cases
Email - homes for sale / for rent
Home Details - homes for sale / homes like this
Personalized Search
Mobile - smart SMS and push notifications
Home owner / pre-seller predictions
Lender selection algorithm
Similar photos / video
Architecture
RECOMMENDATION API(Python, R, Flask)
Zillow Group
Data Lake(S3 / Kinesis)
Property Featurization(Spark EMR)
User Profiles(Spark EMR)
Ranking(Spark EMR)
Wedge Counting
Collaborative Filtering(Spark EMR)
Property Aggregate Features(Spark EMR)
Data Collection Systems(Java/Python/SQL)
Like vs. dislike
Predict homes per user using behavior of similar users
Like = user actively engaged with property
Dislike = user viewed property but weak engagement
$22M
$19M
$664K
?+
+
- +
-
Spencer Stan
Feature Description
uid unique id of user
pid Property id
first_visit timestamp or 0
num_views sigmoid(#views)
time_spent time on page
num_contacts # leads sent
num_saves # saves on zpid
num_shares # shares on zpid
num_photos # photos viewed
Wedge count
For all user & property pairs to form a prediction, perform wedge count
- http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf
Does Stan like $19M? Wedge #
3
(wedge03_cnt
)
5
(wedge05_cnt
)
$22M
+
-
$19M+
?
Spencer
Stan
$664k
-
+
$19M+
?
Spencer
Stan
Classifier
Gradient Boosting Classifier (sklearn)
Popular users / properties:
- Divide wedge counts by degree product ju * ki
Prediction for all user / property pairs, limit candidate set by
- Top 10 zip codes
- 300 properties per user
features
wedge00_cnt
wedge01_cnt
wedge02_cnt
wedge03_cnt
wedge04_cnt
wedge05_cnt
wedge06_cnt
wedge07_cnt
wedge00_norm_cnt
wedge01_norm_cnt
wedge02_norm_cnt
wedge03_norm_cnt
wedge04_norm_cnt
wedge05_norm_cnt
wedge06_norm_cnt
wedge07_norm_cnt
Does Stan like the $19M home? features
(uid: Stan, pid: $19M) (see right side)
User profile
Signals - website, mobile app, and search queries
Binary classification
- labels (like/dislike) same as collab filtering model
User profile model determines preference scores
Features (categorical
variables)
Bath 0_bath, 0.5_bath, 1_Bath,
1.5_bath, 2_bath,
2.5_bath, 3_bath
Bed 0_bed, 1_bed, 2_bed,
3_bed, 4_bed, 5_bed
Price 100_125_price,
125_150_price,
150_175_price
Use
Code
condo, single_family,
farm_land
Zipcode zip_98109
pid uid features label
0 or 1 - see right side 0 or 1
0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6
Ranking
Property matrix - feature space same as user profile
Dot product of property matrix with user profile vector
Age decay for older listings
(uid, pid) score
{"uId":"10307499",
"pId":"1044183744"}
0.3364
1 0 0 0
0 0 1 0
1 0 0 0
0 0 0 1
0
0.01
0.8
0.6
0_bed 1_bed 2_bed 3_bed uid_0
pid_0
pid_1
pid_2
pid_3
=
0
0.8
0
0.6
Training & scoring
Collect user behavior and real-estate data, train the various models, generate the
candidate set, and make predictions.
User
Behavior
(Kinesis
/S3)
Public
Record
(Kinesis
/ S3)
Event API
(Java)
Producer
(Python)
Filter
(Spark)User Store
(Hive / S3)
Spark job creates Hive
table with user events
(uid, pid) partitioned
by date
Active
Listings
(Kinesis
/ S3)
Producer
(Python)
Training Data
(Spark)Training Set
(Hive / S3)
pid -> uid reverse index
Past and current
user events
Models
(Python)
Train Models
(Spark)
Score
(Spark)
Recommendations
Property Data
Collaborative Filtering
/ User Profile Models
Hashmap
(Redis)
Wedge features or property
features (user profile)
Offline evaluation
Hyperparameter tuning with validation set
Training/test data sets for model evaluation
Offline Metrics Description
Precision rk = # recommended properties in test set in top k
Recall n = total properties in the test set
Freshness # listings recommended w/ modified date < y day old in top k
Coverage # unique listings recommended across all users / total # unique listings
Future work
Classifiers for listing descriptions
Deep learning on listing images
Structured streaming on Spark 2.0
Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy
Real-time scoring
Thank you!
aws.amazon.com/emr/
aws.amazon.com/blogs/big-data/
http://www.zillow.com/data-science/
Come join us @ Zillow Group!
Hiring:
- SDE, ML, Data Scientist
- Big Data Engineer
- Analytic Engineer
- Product Management
Remember to complete
your evaluations!
Related Sessions
Top Related