AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple...

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Daryan Dehghanpisheh, SVP Digital Strategy, The Howard Hughes Corporation

Mick Bass, CEO, 47Lining

November 2016

MAC302

Leveraging Amazon Machine Learning,

Amazon Redshift, and an Amazon S3 Data

Lake for Strategic Advantage in Real Estate

Speakers

Daryan DehghanpishehSVP, Digital Strategy

The Howard Hughes Corporation

Mick BassCEO

47Lining

What to Expect from the Session

How to use machine learning to

improve business results

How to architect a data lake atop S3

that fuses on-premises, 3rd party

and public data sets

How to commission Lakeshore

Analytics in Amazon Redshift

Strategies for development of

summaries and aggregates for

Amazon Machine Learning

Training and running Amazon

Machine Learning to attain

predictive accuracy

Project Alamo

Trump’s Data Lake

220M+ Voter Profiles

100K Hyper Targeted Ads

$200M+ Donations

< 120 Days

Challenge

About

Seaport District Ward Village The Woodlands

Downtown Summerlin Summerlin Downtown Columbia

New York, NY Honolulu, HI The Woodlands, TX

Summerlin, NV Summerlin, NV Columbia, MD

Capital intensive Lots of human touch Micro/macro exposure

Long product cycles Commodity offering Fragmented market

About Real Estate

“Big Data” problems we want to solve.

• Can we more accurately predict trends?

• Can we better forecast product demand?

• Can we speed up our sales cycles & time to money?

• Can we more accurately assess value & price?

• Can we use non-traditional data to find causality & correlation?

The Team’s Task: Design a scalable solution that’s cost effective.

1. Combine large public & private data sources.

2. Simply perform lots of complex joins.

3. Improve the company’s data hygiene.

4. Enrich pricing & valuation models.

5. Build proprietary models & sources.

And… do all of this without adding labor costs

or exploding our infrastructure & license costs.

The big mental shifts…

Talent in the group is a profit center, not a cost center.

Our data is a key asset, worth a true dollar figure.

func(digital != “IT”);

Case study:

Propensity to Buy Luxury PropertyPredictive Analytics Example

Luxury Leads Business Requirements:

The test:

Can we accurately identify potential buyers using data?

The conditions:

• New luxury product in an untested market.

• Need new leads beyond in-bound requests.

• Must drive down our cost per lead.

• Build a machine to provide continuous insight.

Luxury Leads High-Level Process

Target Market Whole US Market

Transactions B

Transactions A

“Union View”

Combined Data

Sources

Data AugmentedFeatures / Signatures

Generate

Clustering & ML

Features

Data AugmentedFeatures / Signatures

Clusters

Segments

Personas

Machine Learning

Propensity Predictor

Sourced Data

Refined

Engagement

Mechanisms

(US)

Refined

Engagement

Mechanisms

(Target)

Machine Learning

Propensity Predictor

Leads Database

Proactive

Call

List

Lead

Scoring

(Historic

TAM)

Solution

What is a Data Lake?

A “Data Lake” is a repository that holds raw data in

its native format until it is needed by down stream

analytics processes.

Why is S3 a Natural Fit for Data Lakes?

No need to build a complicated stack

Simplicity = Freedom

• Inherent redundancy at low cost

• Massively parallel IO

• Separate storage from compute

• Integration with Amazon Redshift, EMR, etc.

Can I Just Put Data in S3 and Call it a Lake?

Unmanaged Lake = Swamp

• How do you find things?

• Does everyone just have access to

everything?

• Does all data stay there forever?

What Separates a Data Lake from a Data Swamp?

Intelligent use of storage

conventions in S3

Fine-grained permissionsto contribute, discover, transform and consume data

Data ingest standards

Defining and enforcing data governance processes

MetadataTo enable search, discovery

Data Lake Reference Architecture

BITools

DataContributors

ManagedEnterpriseDataLake

ExternalSystems

DataLakeGovernors(Governance,En tlements)

DataConsumersB2E|B2B|B2CDirectUsers

BusinessProcesses

RawSubmissionsUntransformedBatch|Stream

ManagedDatasetsDataManagedbyLake

Suppor ngSchema-on-ReadUsageData|Metadata

PublishedDataIndexed,consumablevia

HADataLakeAPI

Indices,History

Contribute

Man

age

Consume

Search

Rules,Policies&En tlementsContribute|Manage|Transform|Access

Rule-DrivenIncrementalLoads,Transforms,Cataloging/Indexing,Publishing

Iden ty&Security Indexing&Search

IngestWorkers,Loaders

DataLakeAPI

AgileLakeshoreAnaly cs

DataMgmt&Orchestra on

DataLakeUI

Owned|On-prem

3rdParty

Partners|VendorsCustomers

AWS

Director

y

Service

Roles

AWSIAM

Perms

AWS

KMS

Monitoring

DataLakeWebUis

Elas cBeanstalk

SearchManage

Consume

SearchManage

Consume

Elas cBeanstalk

SingleSign-OnUnifiedPolicy-BasedEn tlements

S3|Submissions

AmazonKinesis|Submissions

AmazonS3|Content

S3|WorkinProgress

SQSQueue

Lambda

WorkerTier

RDSUI,App&APIState

AmazonDynamoDBDiscoveryViews

AmazonCloudSearch

Facets|Indices|Views

AmazonDynamoDBHAPublishedResults

RStudio AmazonMachine

Learning

Hadoop/SparkOn-demand

Elas cMapReduce,

QuboleRedshi

On-DemandWarehouses

BI/Visualiza on

AWSData

Pipelineairflo

w

TableauServer

AmazonQuicksight

AmazonCloudWatchCloudCheckrAWSCloudTrailDataDog

DataEcosystemAPIUsers

AmazonRedshi


High Level Data Flow & Ops Model

Analysis WIP

Submission

Dataset

“Union View”

Sourced data

Owned data

Data

contributors

Defined

submission

mechanisms

Data lake

governors

• Define datasets managed within data lake

• Define submission mechanisms for each dataset

• Manage submission & access entitlements

• Govern costs associated with datasets & Lakeshore Analytics

• Work with business owners to define required Lakeshore Analytics

• Submit data

using defined

submission

mechanisms

Lakeshore

Analytics

• Consume datasets from data lake

• Use analysis WIP

• Manufacture published results1

2 3

4

Amazon

Kinesis S3

EMR Amazon

Redshift

Generate

clustering & ML

features


Generate Clustering & ML Features

Extracted Features

Clustering

Dimensions

Distance

Heuristic

R Cluster Analysishierarchical | model-based

Profiling

Dimensions

Segment

Analysis

Feature

Definitions

N distinct buyer

personas emerged“Union View”

Sourced Data

Owned Data

Leads Cluster Analysis


Buyer Personas for Marketing

Descriptive analytics on buyer personas = ability

to refine engagement models

Cluster 9


Lead Scoring

1. Train the Model

Qualified

Candidates

ML Training Inputs(per candidate, “rewound” history)

Transaction History

Buy/Sell Quantities

Property TypesTime

3rd Party Data

For each candidate, model predicts:

Total Amount of Future Real Estate Purchases

US Real Estate Activityall buyers & sellers,

all transaction types,past 30-Years

Per-Candidate Statisticsnumber & size of

purchases/sales, locations, …

Bought Nothing

PercentileRank

Bought Most

Bought Little

0%

15%

40%

100%

model predicts rank of candidate

to +/- 20% of actual rank 70% of

the time

Training process detects complex patterns in training inputs that the model uses to

make predictions. Patterns are not available externally.

Train Model Generate Predictions

Lead Scoring

2. Use the Model

Qualified

Candidates

Will Buy Nothing

PercentileRank Will Buy Most

Will Buy Some

0%

60% Sales Focus

Current Data(per candidate)

Transaction History

Buy/Sell Quantities

Property TypesTime

3rd Party Data

Predicted Rank of

Candidates

Generic Luxury Leads High-Level Process

100%

Current Data and

Rank Predictions

Re-Generated Each Night

Scored Leads for Sales Team

Scored call list of real people who have bought

high-end real estate – tied to 9 personas


Leads Cluster Analysis


Lead Scoring

3. Review / Refine Model Performance

Improving ML predictive accuracy…use Amazon Redshift to extract use-case–specific

features

Examples:

• Aggregate computation, e.g., average consumption per month / year

• Periodic behavior frequency extraction

• Volatility analysis & extraction

• Time-series difference analysis

(e.g. average time between A and B, time-adjusted values)

Amazon Redshift + Amazon Machine Learning

…better together

Time-series difference analysis example

today

recent past

Behaviors 1

today

A Long Time Ago

Behaviors 2

A

B

C

D

A

A

A

B

B

C

C

D

D

Behaviors 1 Behaviors 2

Net Value

Fre

qu

ency

Selll to Buy

Hold

Tim

e

Amazon

Redshift

Amazon

ML

Amazon

Redshift

Technical Benefits of Approach:

Managed services that “just work”

providing speed, agility and scale

Amazon ML delivered

higher predictive accuracy

for propensity to buy

Payoff

Business benefits of approach:

• Extensible.

• Adaptive.

• Open standards. Can work with lots of partners.

• On demand.

• Ever growing talent pool.

Robots

Rock!

Thank you!

Remember to complete

your evaluations!

AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple...

Technology

Transcript of AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple...