Download - AWS re:Invent 2016: Zillow Group: Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark (MAC303)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jonathan Fritz, Sr. Product Manager, Amazon EMR

Jasjeet Thind, Sr. Director, Data Science & Engineering, Zillow Group

November 29, 2016

MAC303

Zillow Group: Developing Classification and

Recommendation Engines With

Amazon EMR and Apache Spark

What to Expect from the Session

• Apache Spark and Spark ML overview

• Running Spark ML on Amazon EMR

• Interactive notebook options

• Building recommendation engines at Zillow Group

Spark for fast processing

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-

reduce for execution

• Minimizes I/O by storing data

in DataFrames in memory

• Partitioning-aware to avoid

network-intensive shuffle

Spark components to match your use case

Spark ML addresses the full ML pipeline

- Built on top of DataFrame API

- Extract, transform, and select features

- Distributed algorithms

- Classification and Regression

- Clustering

- Collaborative Filtering

- Model selection tools

- Pipelines

Process Data

Feature Extraction

Model Training

Model Testing

Model Validation

Extracting features in DataFrames

- Feature Extractors

- CountVectorizer

- Feature Transformers

- Tokenizer

- Binarizer

- StandardScaler

- Feature Selectors

- VectorSlicer

Many storage layers to choose from

Amazon DynamoDB

Amazon RDS Amazon Kinesis

Amazon Redshift

Amazon S3

Amazon EMR

Training data

Bank loan

write-off

predictions

Classification algorithms in Spark ML

- Logistic regression

- Decision tree classifier

- Random forest classifier

- Gradient-boosted tree classifier

- Multilayer perceptron classifier

- One-vs-Rest classified

- Naive Bayes

What is logistic regression?

What are decision trees?

Weather predictors for Golf

Decision trees: tree induction

Decision trees: partition data with hyperplanes

Spark ML pipelines - training

Spark ML pipelines - testing

Creating a Spark ML pipeline

val pipeline = new Pipeline().setStages(Array(assembler, indexer, dt))

val model = pipeline.fit(df)

val predictions = model.transform(df)

Save and load machine learning models and full Pipelines

Tools to pick the right model

- CrossValidator and TrainValidationSplit select the Model

produced by the best-performing set of parameters

- Split the input data into separate training and test

datasets

- For each (training, test) pair, iterate through the set of

ParamMaps

- Fit the Estimator using those parameters, get the fitted

Model, and evaluate the Model’s performance using the

Evaluator

Why Amazon EMR?

Easy to UseLaunch a cluster in minutes

Low CostPay an hourly rate

Open-Source VarietyLatest versions of software

ManagedSpend less time monitoring

SecureEasy to manage options

FlexibleCustomize the cluster

Develop fast using notebooks and IDEs

• Run Spark Driver in

Client or Cluster mode

• Spark application runs

as a YARN application

• SparkContext runs as a

library in your program,

one instance per Spark

application.

• Spark Executors run in

YARN Containers on

NodeManagers in your

cluster

• Access Spark UI through

the Resource Manager

or Spark History Server

Spark on YARN

Spark UI

Monitor your Spark jobs

Auto Scaling for data science on-demand

YARN metrics

Coming soon: advanced Spot provisioning

Master Node Core Instance Fleet Task Instance Fleet

• Provision from a list of instance types with Spot and On-Demand

• Launch in the most optimal AZ based on capacity/price

• Spot Block support

Productionizing your pipeline

Amazon EMR

Step API

Submit a Spark

application

Amazon EMR

AWS Data Pipeline

Airflow, Luigi, or other

schedulers on EC2

Create a pipeline

to schedule job

submission or create

complex workflows

AWS Lambda

Use AWS Lambda to

submit applications to

EMR Step API or directly

to Spark on your cluster

Recommendation Systems @

Zillow GroupJasjeet Thind

Sr Director, Data Science & Engineering

Agenda

Intro to Zillow Group

Recommendation Use Cases

Architecture

Algorithms

Training & Scoring Pipeline

Metrics

Zillow Group

Build the world's largest, most trusted, and vibrant home-related marketplace.

Recommendation use cases

Email - homes for sale / for rent

Home Details - homes for sale / homes like this

Personalized Search

Mobile - smart SMS and push notifications

Home owner / pre-seller predictions

Lender selection algorithm

Similar photos / video

Architecture

RECOMMENDATION API(Python, R, Flask)

Zillow Group

Data Lake(S3 / Kinesis)

Property Featurization(Spark EMR)

User Profiles(Spark EMR)

Ranking(Spark EMR)

Wedge Counting

Collaborative Filtering(Spark EMR)

Property Aggregate Features(Spark EMR)

Data Collection Systems(Java/Python/SQL)

Like vs. dislike

Predict homes per user using behavior of similar users

Like = user actively engaged with property

Dislike = user viewed property but weak engagement

$22M

$19M

$664K

?+

+

- +

-

Spencer Stan

Feature Description

uid unique id of user

pid Property id

first_visit timestamp or 0

num_views sigmoid(#views)

time_spent time on page

num_contacts # leads sent

num_saves # saves on zpid

num_shares # shares on zpid

num_photos # photos viewed

Wedge count

For all user & property pairs to form a prediction, perform wedge count

- http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf

Does Stan like $19M? Wedge #

3

(wedge03_cnt

)

5

(wedge05_cnt

)

$22M

+

-

$19M+

?

Spencer

Stan

$664k

-

+

$19M+

?

Spencer

Stan

http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf

Classifier

Gradient Boosting Classifier (sklearn)

Popular users / properties:

- Divide wedge counts by degree product ju * ki

Prediction for all user / property pairs, limit candidate set by

- Top 10 zip codes

- 300 properties per user

features

wedge00_cnt

wedge01_cnt

wedge02_cnt

wedge03_cnt

wedge04_cnt

wedge05_cnt

wedge06_cnt

wedge07_cnt

wedge00_norm_cnt

wedge01_norm_cnt

wedge02_norm_cnt

wedge03_norm_cnt

wedge04_norm_cnt

wedge05_norm_cnt

wedge06_norm_cnt

wedge07_norm_cnt

Does Stan like the $19M home? features

(uid: Stan, pid: $19M) (see right side)

User profile

Signals - website, mobile app, and search queries

Binary classification

- labels (like/dislike) same as collab filtering model

User profile model determines preference scores

Features (categorical

variables)

Bath 0_bath, 0.5_bath, 1_Bath,

1.5_bath, 2_bath,

2.5_bath, 3_bath

Bed 0_bed, 1_bed, 2_bed,

3_bed, 4_bed, 5_bed

Price 100_125_price,

125_150_price,

150_175_price

Use

Code

condo, single_family,

farm_land

Zipcode zip_98109

pid uid features label

0 or 1 - see right side 0 or 1

0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6

Ranking

Property matrix - feature space same as user profile

Dot product of property matrix with user profile vector

Age decay for older listings

(uid, pid) score

{"uId":"10307499",

"pId":"1044183744"}

0.3364

1 0 0 0

0 0 1 0

1 0 0 0

0 0 0 1

0

0.01

0.8

0.6

0_bed 1_bed 2_bed 3_bed uid_0

pid_0

pid_1

pid_2

pid_3

=

0

0.8

0

0.6

Training & scoring

Collect user behavior and real-estate data, train the various models, generate the

candidate set, and make predictions.

User

Behavior

(Kinesis

/S3)

Public

Record

(Kinesis

/ S3)

Event API

(Java)

Producer

(Python)

Filter

(Spark)User Store

(Hive / S3)

Spark job creates Hive

table with user events

(uid, pid) partitioned

by date

Active

Listings

(Kinesis

/ S3)

Producer

(Python)

Training Data

(Spark)Training Set

(Hive / S3)

pid -> uid reverse index

Past and current

user events

Models

(Python)

Train Models

(Spark)

Score

(Spark)

Recommendations

Property Data

Collaborative Filtering

/ User Profile Models

Hashmap

(Redis)

Wedge features or property

features (user profile)

Offline evaluation

Hyperparameter tuning with validation set

Training/test data sets for model evaluation

Offline Metrics Description

Precision rk = # recommended properties in test set in top k

Recall n = total properties in the test set

Freshness # listings recommended w/ modified date < y day old in top k

Coverage # unique listings recommended across all users / total # unique listings

Future work

Classifiers for listing descriptions

Deep learning on listing images

Structured streaming on Spark 2.0

Cross-brand user signals - Zillow, Trulia, Hotpads, & StreetEasy

Real-time scoring

Thank you!

[email protected]

aws.amazon.com/emr/

aws.amazon.com/blogs/big-data/

http://www.zillow.com/data-science/

Come join us @ Zillow Group!

Hiring:

- SDE, ML, Data Scientist

- Big Data Engineer

- Analytic Engineer

- Product Management

http://www.zillow.com/data-science/

Remember to complete

your evaluations!

Related Sessions