Amazon Elastic MapReduceの紹介(英語)
-
Upload
-ken-tamagawa-amazon-web-services -
Category
Technology
-
view
5.687 -
download
5
description
Transcript of Amazon Elastic MapReduceの紹介(英語)
Amazon Elastic MapReduce
MY BACKGROUND
• Based in Seattle, WA
• Education:– BS in Computer Science, The American University, 1985– Graduate student in Digital Media, University of Washington, 2010
• Background:– Microsoft Visual Studio team– Consulting to startups and VC’s– Amazon employee since 2002
• Evangelist:– Speak– Write– Tweet
• Author, “Host Your Web Site in the Cloud”
• Email: [email protected]• Twitter: @jeffbarr
• What is Big Data
• Elastic MapReduce Overview
• Example Use Cases
• Ecosystem and Tools
• Upcoming Features
• Discussion
AGENDA
• Doesn’t refer just to volume– You can benefit from Big Data infrastructure
without having a ton of data
– Many existing technologies have little problem physically handling large volumes
• Challenges result from the combination of data volume, data structure, and usage demands from that data, usually tied to timeliness
• Big Data Tools are needed to provide a holistic view of enterprise data and systematically harness it for insights and trends
WHAT IS BIG DATA?
• Enables customers to easily, securely and
cost-effectively process vast amounts of
data:
– Spin-up hundreds of instances
– Process hundreds of terabytes of data
• Hosted Hadoop framework running on
Amazon’s web-scale infrastructure
WHAT IS AMAZON ELASTIC MAPREDUCE
• Launch and monitor job flows
• AWS Management Console
• Command line interface
• REST API
WHY USE AMAZON ELASTIC MAPREDUCE
• Elastic MapReduce removes “MUCK” from Big Data processing
– Hard to manage compute clusters
– Hard to tune Hadoop
– Hard to monitor running Job Flows
– Hard to debug Hadoop jobs
– Hadoop issues prevent smooth operation in the cloud
PROBLEMS CUSTOMERS SOLVE WITH
ELASTIC MAPREDUCE
• Targeted advertising / Clickstream analysis
• Data warehousing applications
• Bio-informatics (Genome analysis)
• Financial simulation (Monte Carlo simulation)
• File processing (resize jpegs)
• Web indexing
• Data mining and BI
• Data or I/O Intensive (m1/m2 instances)
– Data Warehouse
– Data Mining
• Click stream, logs, events, etc.
• Compute or I/O Intensive (c1, cc1/HPC instances)
– Credit Ratings
– Fraud Models
– Portfolio analysis
– VaR calculation
HARDWARE REQUIREMENTS FOR USE CASES
CLICKSTREAM ANALYSIS – RAZORFISH AND BEST BUY
• Best Buy came to Razorfish– 3.5 billion records, 71 million unique cookies, 1.7 million targeted ads
required per day
Targeted Ad
User recently
purchased a
home theater
system and is
searching for
video games
(1.7 Million per day)
• Leveraged AWS and Elastic MapReduce– 100 node cluster on demand
– Processing time dropped from 2+ days to 8 hours
– Increased ROAS (Return on Advertising Spend) by 500%
CLICKSTREAM ANALYSIS - ARCHITECTURE
• Invented by Google
• New processing model
• Highly scalable
• Easy to understand
• Industry standard
• Something worth knowing
WHAT IS MAPREDUCE?
• Take input data
• Break in to sub-problems
• Distribute to worker nodes
• Worker nodes process sub-problems in parallel
• Take output of worker nodes and reduce to answer
ELASTIC MAPREDUCE MODEL – OVERVIEW
MAPREDUCE EXAMPLE – WORD COUNT
Input
Map Phase
Mapper
Mapper
Mapper
“This”, Doc1
“Word”, Doc1
“This”, Doc2
“This”, Doc3
Sort
“This”, Doc1
“Word”, Doc1
“This”, Doc2
“This”, Doc3
“Word”, Doc3“Word”, Doc3
Reduce Phase
Reducer
Reducer
Output
“This”, 3
“Word”, 2
ELASTIC MAPREDUCE MODEL – DETAILED
ELASTIC MAPREDUCE IN ACTION – S3 LOG FILE
ELASTIC MAPREDUCE IN ACTION – STEP 1
ELASTIC MAPREDUCE IN ACTION – STEP 2
ELASTIC MAPREDUCE IN ACTION – STEP 3
ELASTIC MAPREDUCE IN ACTION – STEP 4
ELASTIC MAPREDUCE IN ACTION – STEP 5
ELASTIC MAPREDUCE IN ACTION – STEP 6
ELASTIC MAPREDUCE IN ACTION – STEP 7
ELASTIC MAPREDUCE IN ACTION - RESULTS
• Mapper and Reducer in Java JAR files
• Scale as large as needed
– Data
– Processing
– Add nodes (even while running) to speed up
• No need to manage intermediate data
• Suitable for certain types of problems
– Record-oriented input
– No dependencies between records
• No more MUCK – focus on your problem
NOTES / ATTRIBUTES
HADOOP + R
Thank You