AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple...
-
Upload
amazon-web-services -
Category
Technology
-
view
430 -
download
1
Transcript of AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple...
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Daryan Dehghanpisheh, SVP Digital Strategy, The Howard Hughes Corporation
Mick Bass, CEO, 47Lining
November 2016
MAC302
Leveraging Amazon Machine Learning,
Amazon Redshift, and an Amazon S3 Data
Lake for Strategic Advantage in Real Estate
Speakers
Daryan DehghanpishehSVP, Digital Strategy
The Howard Hughes Corporation
Mick BassCEO
47Lining
What to Expect from the Session
How to use machine learning to
improve business results
How to architect a data lake atop S3
that fuses on-premises, 3rd party
and public data sets
How to commission Lakeshore
Analytics in Amazon Redshift
Strategies for development of
summaries and aggregates for
Amazon Machine Learning
Training and running Amazon
Machine Learning to attain
predictive accuracy
Project Alamo
Trump’s Data Lake
220M+ Voter Profiles
100K Hyper Targeted Ads
$200M+ Donations
< 120 Days
Challenge
About
Seaport District Ward Village The Woodlands
Downtown Summerlin Summerlin Downtown Columbia
New York, NY Honolulu, HI The Woodlands, TX
Summerlin, NV Summerlin, NV Columbia, MD
Capital intensive Lots of human touch Micro/macro exposure
Long product cycles Commodity offering Fragmented market
About Real Estate
“Big Data” problems we want to solve.
• Can we more accurately predict trends?
• Can we better forecast product demand?
• Can we speed up our sales cycles & time to money?
• Can we more accurately assess value & price?
• Can we use non-traditional data to find causality & correlation?
The Team’s Task: Design a scalable solution that’s cost effective.
1. Combine large public & private data sources.
2. Simply perform lots of complex joins.
3. Improve the company’s data hygiene.
4. Enrich pricing & valuation models.
5. Build proprietary models & sources.
And… do all of this without adding labor costs
or exploding our infrastructure & license costs.
The big mental shifts…
Talent in the group is a profit center, not a cost center.
Our data is a key asset, worth a true dollar figure.
func(digital != “IT”);
Case study:
Propensity to Buy Luxury PropertyPredictive Analytics Example
Luxury Leads Business Requirements:
The test:
Can we accurately identify potential buyers using data?
The conditions:
• New luxury product in an untested market.
• Need new leads beyond in-bound requests.
• Must drive down our cost per lead.
• Build a machine to provide continuous insight.
Luxury Leads High-Level Process
Target Market Whole US Market
Transactions B
Transactions A
“Union View”
Combined Data
Sources
Data AugmentedFeatures / Signatures
Generate
Clustering & ML
Features
Data AugmentedFeatures / Signatures
Clusters
Segments
Personas
Machine Learning
Propensity Predictor
Sourced Data
Refined
Engagement
Mechanisms
(US)
Refined
Engagement
Mechanisms
(Target)
Machine Learning
Propensity Predictor
Leads Database
Proactive
Call
List
Lead
Scoring
(Historic
TAM)
Solution
What is a Data Lake?
A “Data Lake” is a repository that holds raw data in
its native format until it is needed by down stream
analytics processes.
Why is S3 a Natural Fit for Data Lakes?
No need to build a complicated stack
Simplicity = Freedom
• Inherent redundancy at low cost
• Massively parallel IO
• Separate storage from compute
• Integration with Amazon Redshift, EMR, etc.
Can I Just Put Data in S3 and Call it a Lake?
Unmanaged Lake = Swamp
• How do you find things?
• Does everyone just have access to
everything?
• Does all data stay there forever?
What Separates a Data Lake from a Data Swamp?
Intelligent use of storage
conventions in S3
Fine-grained permissionsto contribute, discover, transform and consume data
Data ingest standards
Defining and enforcing data governance processes
MetadataTo enable search, discovery
Data Lake Reference Architecture
BITools
DataContributors
ManagedEnterpriseDataLake
ExternalSystems
DataLakeGovernors(Governance,En tlements)
DataConsumersB2E|B2B|B2CDirectUsers
BusinessProcesses
RawSubmissionsUntransformedBatch|Stream
ManagedDatasetsDataManagedbyLake
Suppor ngSchema-on-ReadUsageData|Metadata
PublishedDataIndexed,consumablevia
HADataLakeAPI
Indices,History
Contribute
Man
age
Consume
Search
Rules,Policies&En tlementsContribute|Manage|Transform|Access
Rule-DrivenIncrementalLoads,Transforms,Cataloging/Indexing,Publishing
Iden ty&Security Indexing&Search
IngestWorkers,Loaders
DataLakeAPI
AgileLakeshoreAnaly cs
DataMgmt&Orchestra on
DataLakeUI
Owned|On-prem
3rdParty
Partners|VendorsCustomers
AWS
Director
y
Service
Roles
AWSIAM
Perms
AWS
KMS
Monitoring
DataLakeWebUis
Elas cBeanstalk
SearchManage
Consume
SearchManage
Consume
Elas cBeanstalk
SingleSign-OnUnifiedPolicy-BasedEn tlements
S3|Submissions
AmazonKinesis|Submissions
AmazonS3|Content
S3|WorkinProgress
SQSQueue
Lambda
WorkerTier
RDSUI,App&APIState
AmazonDynamoDBDiscoveryViews
AmazonCloudSearch
Facets|Indices|Views
AmazonDynamoDBHAPublishedResults
RStudio AmazonMachine
Learning
Hadoop/SparkOn-demand
Elas cMapReduce,
QuboleRedshi
On-DemandWarehouses
BI/Visualiza on
AWSData
Pipelineairflo
w
TableauServer
AmazonQuicksight
AmazonCloudWatchCloudCheckrAWSCloudTrailDataDog
DataEcosystemAPIUsers
AmazonRedshi
Luxury Leads High-Level Process
High Level Data Flow & Ops Model
Analysis WIP
Submission
Dataset
“Union View”
Sourced data
Owned data
Data
contributors
Defined
submission
mechanisms
Data lake
governors
• Define datasets managed within data lake
• Define submission mechanisms for each dataset
• Manage submission & access entitlements
• Govern costs associated with datasets & Lakeshore Analytics
• Work with business owners to define required Lakeshore Analytics
• Submit data
using defined
submission
mechanisms
Lakeshore
Analytics
• Consume datasets from data lake
• Use analysis WIP
• Manufacture published results1
2 3
4
Amazon
Kinesis S3
EMR Amazon
Redshift
Generate
clustering & ML
features
Luxury Leads High-Level Process
Generate Clustering & ML Features
Extracted Features
Clustering
Dimensions
Distance
Heuristic
R Cluster Analysishierarchical | model-based
Profiling
Dimensions
Segment
Analysis
Feature
Definitions
N distinct buyer
personas emerged“Union View”
Sourced Data
Owned Data
Leads Cluster Analysis
Luxury Leads High-Level Process
Buyer Personas for Marketing
Descriptive analytics on buyer personas = ability
to refine engagement models
Cluster 9
Luxury Leads High-Level Process
Lead Scoring
1. Train the Model
Qualified
Candidates
ML Training Inputs(per candidate, “rewound” history)
Transaction History
Buy/Sell Quantities
Property TypesTime
3rd Party Data
For each candidate, model predicts:
Total Amount of Future Real Estate Purchases
US Real Estate Activityall buyers & sellers,
all transaction types,past 30-Years
Per-Candidate Statisticsnumber & size of
purchases/sales, locations, …
Bought Nothing
PercentileRank
Bought Most
Bought Little
0%
15%
40%
100%
model predicts rank of candidate
to +/- 20% of actual rank 70% of
the time
Training process detects complex patterns in training inputs that the model uses to
make predictions. Patterns are not available externally.
Train Model Generate Predictions
Lead Scoring
2. Use the Model
Qualified
Candidates
Will Buy Nothing
PercentileRank Will Buy Most
Will Buy Some
0%
60% Sales Focus
Current Data(per candidate)
Transaction History
Buy/Sell Quantities
Property TypesTime
3rd Party Data
Predicted Rank of
Candidates
Generic Luxury Leads High-Level Process
100%
Current Data and
Rank Predictions
Re-Generated Each Night
Scored Leads for Sales Team
Scored call list of real people who have bought
high-end real estate – tied to 9 personas
Luxury Leads High-Level Process
Leads Cluster Analysis
Luxury Leads High-Level Process
Lead Scoring
3. Review / Refine Model Performance
Improving ML predictive accuracy…use Amazon Redshift to extract use-case–specific
features
Examples:
• Aggregate computation, e.g., average consumption per month / year
• Periodic behavior frequency extraction
• Volatility analysis & extraction
• Time-series difference analysis
(e.g. average time between A and B, time-adjusted values)
Amazon Redshift + Amazon Machine Learning
…better together
Time-series difference analysis example
today
recent past
Behaviors 1
today
A Long Time Ago
Behaviors 2
A
B
C
D
A
A
A
B
B
C
C
D
D
Behaviors 1 Behaviors 2
Net Value
Fre
qu
ency
Selll to Buy
Hold
Tim
e
Amazon
Redshift
Amazon
ML
Amazon
Redshift
Technical Benefits of Approach:
Managed services that “just work”
providing speed, agility and scale
Amazon ML delivered
higher predictive accuracy
for propensity to buy
Payoff
Business benefits of approach:
• Extensible.
• Adaptive.
• Open standards. Can work with lots of partners.
• On demand.
• Ever growing talent pool.
Robots
Rock!
Thank you!
Remember to complete
your evaluations!