Automate your Funnel: Workflows that Work from Top to Bottom (INBOUND16)
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
2.023 -
download
6
description
Transcript of Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
![Page 1: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/1.jpg)
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
SVC201 - Automate Your Big Data Workflows
Jinesh Varia, Technology Evangelist
@jinman
November 14, 2013
![Page 2: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/2.jpg)
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Big Data Workflows
Automating Compute Automating Data
![Page 3: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/3.jpg)
DeciderWorker Starters
Activity Worker
Amazon SWF
Activity Worker
AWS Management ConsoleHistory
Amazon SWF – Your Distributed State Machine in the Cloud
SWF helps you scale your business logic
![Page 4: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/4.jpg)
Tim JamesVijay Ramesh
- Data/science Architect, Manager- Data/science Engineer
![Page 5: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/5.jpg)
the world's largest petition platform
![Page 6: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/6.jpg)
![Page 7: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/7.jpg)
![Page 8: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/8.jpg)
At Change.org in the last year
• 120M+ signatures — 15% on victories• 4000 declared victories
![Page 9: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/9.jpg)
![Page 10: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/10.jpg)
![Page 11: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/11.jpg)
![Page 12: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/12.jpg)
This works.
![Page 13: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/13.jpg)
How?
![Page 14: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/14.jpg)
60-90% signatures at Change.orgdriven by email
![Page 15: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/15.jpg)
![Page 16: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/16.jpg)
This works.
![Page 17: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/17.jpg)
This works.* up to a point!
*
![Page 18: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/18.jpg)
Manual Targeting doesn’t
scale.
![Page 19: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/19.jpg)
Manual Targeting doesn’t scale
cognitively.
![Page 20: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/20.jpg)
Manual Targeting doesn’t scale
in personnel.
![Page 21: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/21.jpg)
Manual Targeting doesn’t scale
into mass customization.
![Page 22: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/22.jpg)
Manual Targeting doesn’t scale
culturally or internationally.
![Page 23: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/23.jpg)
Manual Targeting doesn’t scale
with data size and load.
![Page 24: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/24.jpg)
So what did we do?
![Page 25: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/25.jpg)
We used big-compute machine learning to automatically target our mass emails across each week’s set of campaigns.
![Page 26: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/26.jpg)
We started from here...
![Page 27: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/27.jpg)
And finished here...
![Page 28: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/28.jpg)
First: Incrementally extract (and verify) MySQL data to Amazon S3
![Page 29: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/29.jpg)
Best Practice:
Incrementally extract with high watermarking.
(not wall-clock intervals)
![Page 30: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/30.jpg)
Best Practice:
Verify data continuity after extract.
We used Cascading/Amazon EMR + Amazon SNS.
![Page 31: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/31.jpg)
Transform extracted data on S3 into “Feature Matrix”using Cascading/Hadoop on Amazon Elastic MapReduce100-instance EMR cluster
![Page 32: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/32.jpg)
A Feature Matrix is just a text file.
Sparse vector file line format, one line per user.
<user_id>[ <feature_id>:<feature_value>]...
Example:
123 12:0.237 18:1 101:0.578
![Page 33: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/33.jpg)
So how do we do
big-compute Machine Learning?
![Page 34: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/34.jpg)
Enter Amazon • Simple Workflow Service• Elastic Compute Cloud
SWFEC2
![Page 35: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/35.jpg)
SWF and EC2 allowed us to decouple:
• Control (and error) flow
• Task business logic• Compute resource provisioning
![Page 36: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/36.jpg)
SWF provides a distributed application model
![Page 37: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/37.jpg)
Decider processes make discrete workflow decisions
Independent task lists (queues) are processed by task list-affined worker processes (thus coupling task types to provisioned resource types)
SWF provides a distributed application model
![Page 38: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/38.jpg)
Allows deciders and workers to be implemented in any language.
We used Rubywith ML calculations done by Python, R, or C.
SWF provides a distributed application model
![Page 39: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/39.jpg)
Rich web interface via the AWS Management Console.
Flexible API for control and monitoring.
SWF provides a distributed application model
![Page 40: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/40.jpg)
Resource Provisioning with EC2
Our EC2 instances each provide servicevia Simple Workflow Service
for a single Feature Matrix file.
![Page 41: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/41.jpg)
Simplifying Assumption:
Full feature matrix file fits on disk of a m1.medium EC2 instance (although we compute it with 100-instance EMR cluster)
![Page 42: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/42.jpg)
Best Practice:
Treat compute resources as
hotel rooms, not mansions.
![Page 43: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/43.jpg)
Worker EC2 Instance bootstrap from base Amazon Machine Image (AMI)
EC2 instance tags provide highly visible, searchable configuration.
Update local git repo to configured software version.
![Page 44: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/44.jpg)
EC2 instance tags
![Page 45: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/45.jpg)
Best Practice:
Log bootstrap steps to S3mapping essential config tags to EC2 instance names and log files
![Page 46: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/46.jpg)
Amazon SWF and EC2 allowed us to build a common reliable scaffold for R&D and production Machine Learning systems.
![Page 47: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/47.jpg)
Provisioning in R&D for Training
• Used 100 small EC2 instances to explore the Support Vector Machine (SVM) algorithm to repeatedly brute-force search a 1000-combination parameter space
• Used a 32-core on-premises box to explore a Random Forest implementation in multithreaded Python
![Page 48: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/48.jpg)
Provisioning in Production
• Train with single SWF worker using multiple cores (python multithreaded Random Forest)
• Predict with 8 SWF workers — 1 per core, 4 cores per instance
Start n m3.2xlarge EC2 instances on-demand for each campaign in the sample group
![Page 49: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/49.jpg)
Provisioning in Production
![Page 50: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/50.jpg)
![Page 51: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/51.jpg)
Best Practice:
Use Amazon SWF to decouple and defer crucial provisioning and application design decisions until you’re getting results.
![Page 52: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/52.jpg)
Forward scale
So from here,
how can we expect this system to scale?
![Page 53: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/53.jpg)
Forward scale
• Run more EMR instances to build Feature Matrix
• Run more SWF predict workersper campaign
for 10x users
![Page 54: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/54.jpg)
Forward scale
• already automatically start a SWF worker group per campaign
• “user generated campaigns” require no campaigner time and are targeted automatically
for 10x campaigns
![Page 55: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/55.jpg)
Forward scale
• system eliminates mass email targeting contention, so team can scale
for 2x+ campaigners
![Page 56: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/56.jpg)
Win for our Campaigners... and Users.
Our user base can now be automatically segmented across a wide pool of campaigns, even internationally.
30%+ conversion boost over manual targeting.
![Page 57: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/57.jpg)
![Page 58: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/58.jpg)
Do you build systems like these?Do you want to?
We’d love to talk.(And yes, we’re hiring.)
![Page 59: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/59.jpg)
UNSILODr. Francisco Roque, Co-Founder and CTO
![Page 60: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/60.jpg)
A collaborative search platform that helps you see patterns across Science & Innovation
![Page 61: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/61.jpg)
Mission
UNSILO breaks down silos and makes it easy and fast for you to find relevant knowledge written in domain-specific
terminologies
![Page 62: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/62.jpg)
Describe Discover Analyze & Share
Unsilo
![Page 63: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/63.jpg)
New way of searching
![Page 64: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/64.jpg)
Big Data Challenges
4.5 million USPTO granted patents
12 million scientific articles
Heterogeneous processing pipeline
(multiple steps, variable times)
![Page 65: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/65.jpg)
A small test
1000 documents20 minutes/doc average
![Page 66: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/66.jpg)
A bigger test
100k documents3.8 years?
![Page 67: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/67.jpg)
A bigger test
100k documents8x8 cores~21 days
![Page 68: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/68.jpg)
4.5 million patents?12 million articles?
![Page 69: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/69.jpg)
Focus on the goal
![Page 70: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/70.jpg)
Amazon SWF to the rescue
• Scaling• Concurrency• Reliability• Flexibility to experiment• Easily adaptable
![Page 71: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/71.jpg)
SWF makes it very easy to separate algorithmic logic and workflow logic
![Page 72: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/72.jpg)
Easy to get started: First document batch running in just 2 weeks
![Page 73: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/73.jpg)
AWS services
![Page 74: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/74.jpg)
Adding content
![Page 75: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/75.jpg)
Job Loading
• Content loaded by traversing S3 buckets
• Reprocessing by traversing tables on DynamoDB
DynamoDB
![Page 76: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/76.jpg)
Decision Workers
• Crawls Workflow Historyfor Decision Tasks
• Schedules new ActivityTasks
DynamoDB
![Page 77: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/77.jpg)
Activity Workers
• Read/write to S3• Status in DynamoDB• SWF task inputs passed
between workflow steps • Specialized workers
DynamoDB
![Page 78: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/78.jpg)
Best practice
Use DynamoDB for content status
Index on different columns (local indexes)
More efficient content status queriesGive me all the items that completed step X
Elastic service!
![Page 79: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/79.jpg)
Key to scalability
File organization on S3 for scalability– 50 req/s naïve approach– >1500 req/seq
logs/2013-11-14T23:01:34/...logs/2013-11-14T23:01:23/...logs/2013-11-14T23:01:15/..."
43:10:32T41-11-3102/logs/...32:10:32T41-11-3102/logs/...51:10:32T41-11-3102/logs/..."
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.htmlhttp://goo.gl/JnaQZV
![Page 80: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/80.jpg)
Gearing
Ratio?
![Page 81: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/81.jpg)
Monitoring
Give me all the workers/instances that have not responded in the past hour
![Page 82: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/82.jpg)
Amazon SWF components
DynamoDB
![Page 83: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/83.jpg)
Throttling and eventual consistency
Failed?Try Again
![Page 84: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/84.jpg)
Development environment
![Page 85: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/85.jpg)
Huge benefits
100k Documents21 days < 1 hour
4.5 Million USPTO~30 hours
![Page 86: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/86.jpg)
Huge benefits
Focus on our goal, faster time to market
Using Spot instances, 1/10 cost
![Page 87: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/87.jpg)
Key SWF Takeaways
Flexibility– Room for experimentation
Transparency– Easy to adapt
Growing with the system– Not constrained by the framework
Decider
Worker
Worker
Amazon SWF
![Page 88: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/88.jpg)
UNSILOwww.unsilo.com
Sign up to be invited for the Public Beta
![Page 89: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/89.jpg)
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Compute Automating Data
Automating Big Data Workflows
![Page 90: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/90.jpg)
Compute Resources
Data Data
Data Stores Data Stores
AWS Data Pipeline Your ETL in the Cloud
![Page 91: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/91.jpg)
Inter-region ETL
S3 EMR S3 DynamoDBEMRS3 S3 RDSEC2
S3 RedshiftEMR DynamoDBEMRDynamoDB S3 Hive/Pig Redshift
Intra-region ETL Cloud-On-Prem ETL
AWS Data Pipeline Patterns (ActivityWorkers)
![Page 92: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/92.jpg)
Fred Benenson, Data Engineer
![Page 93: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/93.jpg)
A new way to fund creative projects:
All-or-nothing fundraising.
![Page 94: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/94.jpg)
5.1 million people have backed a project
![Page 95: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/95.jpg)
51,000+ successful projects
![Page 96: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/96.jpg)
44% of projects hit their goal
![Page 97: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/97.jpg)
$872 million pledged
![Page 98: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/98.jpg)
78% of projects raise under $10,000
51 projects raised more than $1 million
![Page 99: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/99.jpg)
Project case study: Oculus Rift
![Page 100: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/100.jpg)
Data @
• We have many different data sources
• Some relational data, like MySQL on Amazon RDS
• Other unstructured data like JSON stored in a
third-party service like Mixpanel
• What if we want to JOIN between them in Amazon
Redshift?
![Page 101: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/101.jpg)
Case study: Find the users that have Page View A but not User Action B
• Page View A is instrumented in Mixpanel, a third-party service whose API we have access:
{ “Page View A”, { user_uid : 1231567, ... } }
• But User Action B is just the existence of a timestamp in a MySQL row:
6975, User Action B, 1231567, 2012-08-31 21:55:466976, User Action B, 9123811, NULL6977, User Action B, 2913811, NULL
![Page 102: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/102.jpg)
Redshift to the Rescue!SELECTusers.id,COUNT(DISTINCTCASE WHEN user_actions.timestamp IS NOT NULLTHEN user_actions.id ELSE NULL
END) as event_b_countFROM usersINNER JOIN mixpanel_events ON mixpanel_events.user_uid = users.uid AND mixpanel_events.event = 'Page View A'
LEFT JOIN user_actions ON user_actions.user_id = users.idGROUP BY users.id
![Page 103: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/103.jpg)
How we do automate the data flow to keep it fresh daily?
![Page 104: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/104.jpg)
![Page 105: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/105.jpg)
![Page 106: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/106.jpg)
![Page 107: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/107.jpg)
![Page 108: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/108.jpg)
But how do we get the data to Redshift?
![Page 109: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/109.jpg)
This is where AWSData Pipeline comes in.
![Page 110: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/110.jpg)
Pipeline 1: RDS to Redshift - Step 1
First, we run sqoop on Elastic MapReduce to extract MySQL tables into CSVs.
AWS
![Page 111: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/111.jpg)
Pipeline 1: RDS to Redshift - Step 2
Then we run another Elastic MapReduce streaming job to convert NULLs into empty strings for Redshift.
![Page 112: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/112.jpg)
• 150 - 200 gigabytes• New DB every day, drop old tables
• Using AWS Data Pipeline’s 1-day ‘now’ schedule
Pipeline 1: RDS to Redshift - Transfer to S3
![Page 113: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/113.jpg)
Pipeline 1: RDS to Redshift Again
Run a similar pipeline job in parallel for our other database.
![Page 114: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/114.jpg)
Pipeline 2: Mixpanel to Redshift - Step 1
Spin up an EC2 instance to download the day’s data from Mixpanel.
![Page 115: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/115.jpg)
Use Elastic MapReduce to transform Mixpanel’s unstructured JSON into CSVs.
Pipeline 2: Mixpanel to Redshift - Step 2
![Page 116: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/116.jpg)
• 9-10 gb per day• Incremental data• 2.2+ billion events• Backfilled a year in 7 days
Pipeline 2: Mixpanel to Redshift - Transfer to S3
![Page 117: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/117.jpg)
• JSON / CLI tools are crucial• Build scripts to generate JSON• ShellCommandActivity is powerful• Really invest time to understand
scheduling• Use S3 as the “transport” layer
AWS Data PipelineBest Practices
![Page 118: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/118.jpg)
AWS Data Pipeline Takeaways for Kickstarter
15 years ago: $1 million or more
5 years ago: Open source + staff & infrastructure
Now: ~$80 a month on AWS
![Page 119: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/119.jpg)
“It just works”
![Page 120: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/120.jpg)
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Compute Automating Data
Automating Big Data Workflows
![Page 121: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/121.jpg)
Decider
Worker
AWS Data Pipeline
Activity Data Node
Worker
Amazon SWF
Automating Compute Automating Data
Automating Big Data Workflows
![Page 122: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/122.jpg)
Big Thank You to Customer Speakers!
Jinesh Varia
@jinman
![Page 123: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/123.jpg)
More Sessions on SWF and AWS Data Pipeline
SVC101 - 7 Use Cases in 7 Minutes Each : The Power of Workflows and Automation (Next Up in this room)
BDT207 - Orchestrating Big Data Integration and Analytics Data Flows with AWS Data Pipeline (Next Up in Sao Paulo 3406)
![Page 124: Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013](https://reader038.fdocuments.net/reader038/viewer/2022103013/540dd58e8d7f728d7e8b4b23/html5/thumbnails/124.jpg)
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
SVC201