Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
18 -
download
8
description
Transcript of Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
DAT 205 - Amazon Redshift in Action
Enterprise, Big Data, and SaaS Use Cases
November 15, 2013
Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift architecture
• Leader Node – SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes – Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB
• Single node version available
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Amazon Redshift is priced to let you analyze all your data
Price Per Hour for
HS1.XL Single Node
Effective Hourly
Price per TB
Effective Annual
Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Data Warehousing for Capital Markets
Jason Timmes, AVP of Software Development, NASDAQ OMX
November 15, 2013
Where innovation meets action
6
WE LIST ~3300 GLOBAL COMPANIES WORTH
IN MARKET CAP REPRESENTING
$6 TRILLION
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDS
3 CLEARINGHOUSES
WE OWN AND OPERATE
26 MARKETS
AND 5 CENTRAL
SECURITIES DEPOSITORIES
MORE THAN 5500 STRUCTURED PRODUCTS
ARE TIED TO OUR GLOBAL INDEXES WITH THE NOTIONAL VALUE OF
AT LEAST $1 TRILLION
OUR TECHNOLOGY IS USED TO POWER MORE THAN
IN 50 COUNTRIES 70 MARKETPLACES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
1 MILLION MESSAGES/SECOND AT A MEDIAN SPEED OF SUB-55 MICROSECONDS
including
W E P O W E R 1 IN 10
OF THE WORLD’S SECURITIES TRANSACTIONS
What I do
New data and analytics platforms to store and
serve data to internal and external customers.
The Challenge
• Archiving Market Data – classic “Big Data” problem
• Power Surveillance and Business Intelligence/Analytics
• Minimize cost – Not only infrastructure, but development/IT labor costs too
• Empower the business for self-service
Financial Information Forum, Redistribution without permission from FIF prohibited, email: [email protected]
SIP Total Monthly Message VolumesOPRA, UQDF and CQS
23
OPRA Annual Increase: 69%CQS Annual Increase: 10%UQDF Annual Decrease: 6%
Total Monthly Message Volume Average Daily
Volume Date OPRA
Aug-12 80,600,107,361 3,504,352,494
Sep-12 77,303,404,427 4,068,600,233
Oct-12 98,407,788,187 4,686,085,152
Nov-12 104,739,265,089 4,987,584,052
Dec-12 81,363,853,339 4,068,192,667
Jan-13 82,227,243,377 3,915,583,018
Feb-13 87,207,025,489 4,589,843,447
Mar-13 93,573,969,245 4,678,698,462
Apr-13 123,865,614,055 5,630,255,184
May-13 134,587,099,561 6,117,595,435
Jun-13 162,771,803,250 8,138,590,163
Jul-13 120,920,111,089 5,496,368,686
Aug-13 136,237,441,349 6,192,610,970
Total Monthly Message Volume Combined Average Daily
Volume Date UQDF CQSAug-12 2,317,804,321 8,241,554,280 459,102,548Sep-12 1,948,330,199 7,452,279,225 494,768,917Oct-12 1,016,336,632 7,452,279,225 403,267,422Nov-12 2,148,867,295 9,552,313,807 557,199,100Dec-12 2,017,355,401 8,052,399,165 503,487,728Jan-13 2,099,233,536 7,474,101,082 455,873,077Feb-13 1,969,123,978 7,531,093,813 500,011,463Mar-13 2,010,832,630 7,896,498,260 495,366,545Apr-13 2,447,109,450 9,805,224,566 556,924,273
May-13 2,400,946,680 9,430,865,048 537,809,624Jun-13 2,601,863,331 11,062,086,463 683,197,490Jul-13 2,142,134,920 8,266,215,553 473,106,840
Aug-13 2,188,338,764 9,079,813,726 512,188,750
0
100,000,000
200,000,000
300,000,000
400,000,000
500,000,000
600,000,000
Jan-13 Feb-13 Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13
NASDAQ Exchange Daily Peak Messages
Market
Data
Is Big
Data Charts courtesy of the
Financial Information
Forum
Our legacy solution
• On-premises MPP DB – Relatively expensive, finite storage
– Required periodic additional expenses to add more storage
– Ongoing IT (administrative) human costs
• Legacy BI tool – Requires developer involvement for new data sources, reports,
dashboards, etc.
New Solution: Amazon Redshift
• Cost Effective – Redshift is 43% of the cost of legacy
• Assuming equal storage capacities
– Doesn’t include IT ongoing costs!
• Performance – Easily outperforms our legacy BI/DB solution
– Insert 550K rows/second on a 2 node 8XL cluster
• Elastic – Add additional capacity on demand, easy to grow our cluster
New Solution: Pentaho BI/ETL • Amazon Redshift partner
– http://aws.amazon.com/redshift/partners/pentaho/
• Self Service – Tools empower BI users to
integrate new data sources, create their own analytics, dashboards, and reports without requiring development involvement
• Cost effective
Net Result
• New solution is cheaper, faster, and offers
capabilities that our business didn’t have before – Empowers our business users to explore data like they never
could before
– Reduces IT and development as bottlenecks
– Margin improvement (expense reduction and supports business
decisions to grow revenue)
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
HauteLook + Amazon Redshift
A Case Study
Kevin Diamond, HauteLook
November 15, 2014
Who am I? Kevin Diamond
• CTO of HauteLook, a Nordstrom Company
• Oversee all technology, infrastructure, data,
engineering, etc.
• Major focus on great customer experience and
the analytics to provide it
What is HauteLook?
• Private sale, members-only limited-time sale events
• Premium fashion and lifestyle brands at exclusive prices of
50-75% off
• Over 20 new sale events begin each morning at 8am PST
• Over 14 million members
• Acquired by Nordstrom in 2011
Why a Data Warehouse?
• Centralized storage of multiple data sources
• Singular reporting consistency for all departments
• Data model that supports analytics not transactions
• Operational reports vs. analytical reports – Real-time vs. previous day
Why Amazon Redshift?
• Looked at some competitors: – Ranged from $ to $$$
– All required Software, Implementation and BIG Hardware
• Skipped the RFP
• Jumped into the Public Beta of Amazon Redshift and never looked back
How We Implemented Amazon Redshift
• ETL from MySQL and Microsoft SQL Server into AWS across a Direct Connect line storing on S3
• Also used S3 to dump flat files (iTunes Connect Data, Web Analytics dumps, log files, etc)
• Used AWS Data Pipeline for executing Sqoop and Hadoop running on EC2 to load data into Amazon Redshift
• Redshift Data Model based on Star Schema which looks something like …
Example of Star Schema
Usage with Business Intelligence
• Already selected a BI Tool
• Had difficulty deploying in the cloud
• But worked great on-premises
• Easily tied into Amazon Redshift using ODBC Drivers
• BUT, metadata for reports had to live in MSSQL
• Ported many SSIS/SSRS reports over
– But only the analytical reports!
And it all looks like this
Amazon Redshift Instances
• We use a little under 2TB
• Thought to use 2 - BIG 8XL instance to get great performance (in passive failover mode)
• Cost us $$$
• Then we tested using 6 - XL instances in a cluster
• Performed better and allowed for more concurrency of queries in all but a handful of cases that really needed the 8XL power
• Cost us $
• Duh! That’s why we do distributed everything else!!
Some First Hand Experience
• ETL was hardest part
• Amazon Redshift performs awesome
• Someone needs to make a great client SQL tool
• MicroStrategy works great on it (just wished it loved running in EC2)
• Saving a ton, thanks to:
– No hardware costs
– No maintenance/overhead (rack + power)
– Annual costs are equivalent to just the annual maintenance of some of the cheaper DW on-premises options
Conclusion/Last Advice • Only use 8XL instances if you need >2TB of space
– Otherwise distribute on a bunch of XL nodes
• Buy reserved instances (we still need to do this!) since you likely will have this always on
• Although we haven’t yet, the idea of a flexible scale-up/down DW is crazy awesome – maybe during Holiday we will
• Probably could have used Elastic MapReduce instead of Hadoop – wasn’t sure how it would play with Sqoop
• Almost all BI tools play with Amazon Redshift now, so choose what is right for your business, and make sure it works in EC2 before just putting it there
• Communication between AWS and your DC is easy and fast, but I recommend a Direct Connect
• Passed our rigorous information security standards, but used in a VPC
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases
Parag Thakker – VP, Roundarch Isobar
Colin McGuigan – Architect, Roundarch Isobar
November 15th, 2013
27 27 27 27 27 27 27
OUR SERVICES ACROSS BOUGHT, OWNED AND EARNED MEDIA
Strategies We digitally transform business processes and
disrupt industries
Campaigns We create, measure and optimize digitally-focused
campaigns
Business planning: competitive & industry analysis, business cases, maturity models, roadmaps Strategies: brand, interactive, multi-channel, social, content
Audience insight Communications planning Creative: advertising, visual design, content creation, studio production Optimization: analytics, monitoring, SEO, MVT, media ROI analysis
Experiences We produce joyful
experiences that inspire consumer interaction
Platforms We design and build flexible and scalable technology solutions
Research: competitive, segmentation, persona development, heuristics Requirements and specifications: content analysis and specs, functional requirements, functional specifications User experience design: information architecture, taxonomy and meta data, interaction design, mobile
Platforms: content management, search, portals, mobile, front-end technology, internet-enabled devices/wearables, social apps, web services, security, big data, hosting
Products We invent digital
products that generate new revenue streams
Digital products Digital product extensions Brand as a service
roundarch isobar
28 28 28 28 28 28 28
• 4-5+ million pages daily (40-70 Mbit/sec)
• Portal availability over 99.9% of time
• 28 production enterprise services
• Over 300 applications available
• Public-facing and secure private instances (NIPR & SIPR)
• Portal support for over 5,000 “Communities of Interest”
Key metrics for our USAF work include:
• 900,000+ registered users
• 700,000+ PK-E users
• Response time worldwide: 3 seconds for 80% of all pages
• Over 1.2 million logins/week
• 124,000 unique daily users
U.S. Air Force
We have served the U.S. Air Force since 2001, building their enterprise portal and many mission-critical applications
29 29 29 29 29 29 29
Transforming in-stadium operations through a touch-screen command center
Our executive touch-screen environment provides real-time stadium
and game data, allowing the Jets owner, Woody Johnson, to monitor
the fan experience during game time and make operational
decisions that help maximize sales. The command center provides
summary-level and drill-down views of stadium operations such as
tickets, parking and concessions. It also creates predictive
algorithms that help identify pinch points and open revenue
opportunities.
New York Jets
“We brought the big picture close enough to identify new, better ways to do business.”
30 30 30 30 30 30 30
Technology:
• JavaScript, HTML5, CSS3
• Uses Jquery,
JavascriptMVC, Less
• JSON Web Services
• Java, Spring, JPA, Mongo
DB
• User comment: “We love
how fast it is!”
• Facilitates collaboration between
portfolio managers and analysts
• Provides a holistic view of a
company/stock
– What is everything our
organization knows about
AAPL
• Digitizes PDF/Excel tools and
reports to enable rich, dynamic
interactions
• Simplifies content creation; e.g.,
comments, recommendation
reports, document upload
• Rich charting and visualization of
analytics
William Blair | Investment Research Management System
Through a joint venture with Copia Capital, we created a new product offering for William Blair
31 31 31 31 31 31
What is the focus of your CMO today?
Optimize marketing spend across all channels (Bought, Earned and Owned)
32 32 32 32 32 32
billions marketing spend
dozens media channels
hundreds data sources
multiple terabytes data size
multiple clients
domain
Search
Display
Ads
Affiliate
Social
Mobile
Sales TV
Radio
Web
marketing effectiveness stages
Analyze
Learn
Optimize
• Centralized cross channel
Big Data Platform
• Standardized cross channel
reporting tools
• Discovery tools to identify
channel optimization
opportunities
• Modeling solutions
• Channel experience
enhancements
• Improved media buying,
planning & reporting functions
• Real time integration into DSP
• A/B testing based micro
segment adjustments
DLP AMNET
Scorecard
Scorecard
Compass
Real-Time and Non-Real-Time
Sonar
34 34 34 34 34 34
So what have we accomplished?
Built Marketing Analytics Platform - Radar to enable in-time analytics, reporting and optimization for multiple clients with customized metrics with 200+ feeds (1TB/week) with various frequency, granularity and classification as scalable multi-tenant SaaS platform on Amazon with first launch in 3 months
35 35 35 35 35 35
scorecard dashboard
36 36 36 36 36 36 36
Detailed Analytic Reports
Scorecard App
TV
DDS
Media Team Client Stakeholders
Media Team Planners Client Team
scorecard logical architecture
Paid Search
Google Bing
Marin
Organic Search
Google Bing
Sales
TBD
Digital Video Custom
Site Metrics
Google Omniture
Display
Google DFA
Radio
DDS
Paid Social Facebook
Print OOH
DDS
Earned Social Facebook
Competit
ive Custom
37 37 37 37 37 37
Voluminous Data
Digital
CRM
Research
- Surveys
- Demographics - Campaigns
- Search - Mobile - Attribution - Site - Social - Display
- Cookie Level - UGC - Geospatial - Weather - Sales - Competitive
DA
TA V
OLU
ME
VARIETY and GRANULARITY
data sources
38 38 38 38 38 38
WWW
tech architecture
BI Tools
Analysts
SaaS Reporting Platform
Clients
Hadoop EMR
S3 Redshift MySQL RDS EC2 Beanstalk
Radio
Display Ads
Search
Social
Feeds
39 39 39 39 39 39
Files loaded on Amazon S3/Amazon Glacier
Extract
Utilize Pig on Amazon EMR to cleanse, standardize and validate the data
Transform
Load
ETL
Use COPY to load Pig output
Hadoop EMR
S3 Redshift
Glacier Radio
Display Ads
Search
Social
Feeds
40 40 40 40 40 40
For BI / adhoc analysis
ODBC and JDBC access
Cheap, fast, easily scalable
Performance
data warehouse
Handles humongous aggregation quickly
Redshift
Tableau, BI Tools
Analysts
41 41 41 41 41 41
In Amazon Redshift using SQL
Multi-step aggregation
Join performance data with metadata
Mapping
in MySQL for sub second web response
Load aggregates
Redshift MySQL RDS
SQL
Views, Clicks, CTR, CPC etc
Product, Campaign
Radio
Display Ads
Search
Social
aggregation
Aggregates
42 42 42 42 42 42
Job control dashboard
Jenkins for client+channel ETL
Data intake/extract Amazon DynamoDB for state management
Ruby for provisioning, job flow
Amazon EMR clusters
On demand, job-initiated
data workflow
Hadoop EMR Redshift MySQL RDS S3
DynamoDB
Jenkins
Ruby
43 43 43 43 43 43
Hardware and location
Designed for redundancy
Managed services
Multi-Tenant
For clients
Automated stack provisioning
SaaS dashboard
Load Balancing
ElastiCache
DNS
Client1.com Client2.com
EC2 Beanstalk
MySQL RDS
44 44 44 44 44 44
Scalable
Highly
Quickly with reduced risk
Innovate
To market
Time
Operational overhead
Lower
AWS advantages
Ruby
DevOps Developers
Python
Java
AWS Ops
US AMAZON
45 45 45 45 45 45
Metadata is more important than the data
learnings
Design for scalability upfront
Always explore better ways to aggregate
Cost management is very important
Build Agile: Perform early end-to-end validation on smaller dataset Separate data visualization, data cleansing, storage & data aggregation
Be smart about implementing data aggregation routines across multiple granularities
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
DAT205