Big data on_aws in korea by abhishek sinha (lunch and learn)

Big Data Analytics

Abhishek Sinha

Business Development Manager,

@abysinha

sinhaar@amazon.com

An engineer’s definition

When your data sets become so large that you have to start

innovating how to collect, store, organize, analyze and

share it

What does big data look like ?

Volume

Velocity

Variety

Where is this data coming from ?

Human generated

Machine generated

Surf the internet

Buy and sell products

Upload images and videos

Play games

Check in at restaurants

Search for cafes

Find deals

Watch content online

Look for directions

Use social media

Human generated

Machine generated

Networks and security devices

Mobile phones

Cell phone towers

Smart grids

Smart meters

Telematics from cars

Sensors on machines

Videos from traffic and security cameras

What are people using this for ?

Big Data Verticals and Use cases

Media/Advertising

Targeted Advertising

Image and Video

Processing

Oil & Gas

Seismic Analysis

Retail

Recommendations

Transactions Analysis

Life Sciences

Genome Analysis

Financial Services

Monte Carlo Simulations

Risk Analysis

Security

Anti-virus

Fraud Detection

Image Recognition

Social Network/Gaming

User Demographi

Usage analysis

In-game metrics

Why is big data hard ?

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation

Lower cost,

higher throughput

Generation

Highly

constrained

Lower cost,

higher throughput

Big Gap in turning data into actionable

information

Amazon Web Services helps remove constraints

Big Data + Cloud = Awesome Combination

Big data:

• Potentially massive datasets

• Iterative, experimental style

of data manipulation and

analysis

• Frequently not a steady-state

workload; peaks and valleys

• Data is a combination of

structured and unstructured

data in many formats

AWS Cloud:

• Massive, virtually unlimited

capacity

• Iterative, experimental style of

infrastructure deployment/usage

• At its most efficient with highly

variable workloads

• Tools for managing structured

and unstructured data

Generation

Data size

• Global reach

• Native app for almost every smartphone, SMS, web, mobile-web

• 10M+ users, 15M+ venues, ~1B check-ins

• Terabytes of log data

Scala/Liftweb API Machines WWW Machines Batch Jobs

Scala Application code

Mongo/Postgres/Flat Files

Databases Logs D

Amazon S3 Database Dumps Log Files

Hadoop Elastic Map Reduce

Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs

mongoexport

postgres dump Flume

Stack – Front end Application

Databases Logs D

mongoexport

postgres dump Flume

Stack – Collection and Storage

Databases Logs D

mongoexport

postgres dump Flume

Stack – analysis and sharing

Databases Logs D

mongoexport

postgres dump Flume

Users Overtime

“Who is using our

service?”

Identified early mobile usage

Invested heavily in mobile development

Finding signal in the noise of logs

9,432,061 unique mobile devices

used the Yelp mobile app.

4 million+ calls. 5 million+ directions.

In January 2013

Autocomplete Search

Recommendations

Automatic spelling

corrections

“What kind of movies do people

like ?”

More than 25 Million Streaming Members

50 Billion Events Per Day

30 Million plays every day

2 billion hours of video in 3 months

4 million ratings per day

3 million searches

Device location , time , day, week etc.

Social data

10 TB of streaming data per day

Data consumed in multiple ways

Prod Cluster (EMR)

Recommendati

on Engine

Ad-hoc

Analysis

Personalization

Import/Export

Corporate

data center

Amazon

Elastic

MapReduce

Amazon

Simple

Storage

Service (S3)

BI Users

Clickstream data from

500+ websites and

VoD platform

“Who buys video games?”

Who is Razorfish

• Full service Digital Agency

• Developed an Ad-Serving Platform compatible with most browsers

• Clickstream analysis of data , current historical trends and segmentation of

• Segmentation is used to serve ads and cross sell

• 45TB of Log data

• Problems at scale

– Giant Datasets

– Building Infrastructure requires large continuous investment

– Build for peak holiday season

– Traditional Data stores are not scaling

3.5 billion records

13 TB of click stream logs

71 million unique cookies

Per day:

Previously in 2009

This happens in 8 hours everyday

Why AWS + EMR

• Prefect Clarity of Cost

• No upfront infrastructure investment

• No client processing contention

• Without EMR/Hadoop it takes 3 days , with EMR 8 hours

– Scalability 1 node x 100 hours = 100 nodes x 1 hour

• Meet SLA

Playfish improves in-game experience for its users

through data mining

Challenge: Must understand player usage trends across 50M month users, multiple platforms, 10s of games, and in the face of rapid growth. This

drives both in-game improvements and defines what games to target next.

Solution: EMR provides Playfish the flexibility to

experiment and rapidly ask new questions. All usage data is stored in S3 and analysts run ad-hoc hive queries that can slice the

data by time, game, and user.

Data Driven Game Design

Data is being used to understand what gamers are doing inside the game (behavioral analysis)

- What features people like (rely on data instead of forum posts)

- What features are abandoned

- A/B testing

- Monetization – In Game Analytics

Building a big data architecture

Design Patterns

Generation

Getting your Data into AWS

Amazon S3

Corporate Data Center

• Console Upload

• FTP

• AWS Import Export

• S3 API

• Direct Connect

• Storage Gateway

• 3rd Party Commercial Apps

• Tsunami UDP

Write directly to a data source

Your application Amazon S3

DynamoDB

Any other data store

Amazon S3

Amazon EC2

Queue , pre-process and then write to data source

Amazon Simple Queue Service

Amazon S3

DynamoDB

Agency Customer: Video Analytics on AWS

Elastic Load

Balancer

Edge Servers

on EC2

Workers on

Logs Reports

HDFS Cluster

Amazon Simple Queue

Service (SQS)

Amazon Simple Storage Service

Amazon Elastic MapReduce

Aggregate and write to data source

Flume running

on EC2

Amazon S3

What is Flume

• Collection, Aggregation of streaming Event Data

– Typically used for log data, sensor data , GPS data etc

• Significant advantages over ad-hoc solutions

– Reliable, Scalable, Manageable, Customizable and High Performance

– Declarative, Dynamic Configuration

– Contextual Routing

– Feature rich

– Fully extensible

Typical Aggregation Flow

[Client]+ Agent [ Agent]* Destination

Flume uses a multi-tier approach where multiple agents can send data to

another agent which acts as a aggregator. For each agent , data can from

either an agent or a client or can be sent to another agent or a sink

Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

S3 as a “single source of truth”

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Choose depending upon design

Choice of storage systems (Structure and Volume)

Structure Low High

Dynamo DB

NoSQL EBS

Generation

Hadoop based Analysis

Amazon SQS

Amazon S3

DynamoDB

Amazon

EMR is Hadoop in the Cloud

What is Amazon Elastic MapReduce (EMR)?

A framework Splits data into pieces Lets processing occur

Gathers the results

distributed computing

Number of Machines 1

distributed computing is hard

distributed computing requires god-like engineers

Innovation #1:

Hadoop is… The MapReduce computational paradigm

Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

Person Start End Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 Charlie 11:17:24 11:17:38 Bob 11:23:10 11:23:25 Alice 16:26:46 16:26:54 David 17:20:28 17:20:45 Alice 18:16:53 18:17:00 Charlie 19:33:44 19:33:59 Bob 21:13:32 21:13:43 David 22:36:22 22:36:34 Alice 23:42:01 23:42:11

Person Start End Duration Bob 00:44:48 00:45:11 23 Charlie 02:16:02 02:16:18 16 Charlie 11:16:59 11:17:17 18 Charlie 11:17:24 11:17:38 14 Bob 11:23:10 11:23:25 15 Alice 16:26:46 16:26:54 8 David 17:20:28 17:20:45 17 Alice 18:16:53 18:17:00 7 Charlie 19:33:44 19:33:59 15 Bob 21:13:32 21:13:43 11 David 22:36:22 22:36:34 12 Alice 23:42:01 23:42:11 10

Person Duration Bob 23 Charlie 16 Charlie 18 Charlie 14 Bob 15 Alice 8 David 17 Alice 7 Charlie 15 Bob 11 David 12 Alice 10

Person Duration Alice 8 Alice 7 Alice 10 Bob 23 Bob 15 Bob 11 Charlie 16 Charlie 18 Charlie 14 Charlie 15 David 12 David 17

Person Total

Alice 25

Person Total

Bob 49

Alice 25

Person Total

Charlie 63

Bob 49

Alice 25

Person Total

David 29

Charlie 63

Bob 49

Alice 25

Person Total

David 29

Charlie 63

Bob 49

Alice 25

Person Total Alice 25 Bob 49

Charlie 63 David 29

reduce

Works on one record. In this case it

does “end time minus start time”

In parallel over all the records

Group together common records

(e.g “Alice, Bob”) and add all the

results

Hadoop is… The MapReduce computational paradigm

Hadoop is… The MapReduce computational paradigm … implemented as an Open-source, Scalable, Fault-tolerant, Distributed System

distributed computing requires god-like engineers

distributed computing (with Hadoop) requires god-like talented engineers

Launch a Hadoop cluster from the CLI (

elastic-mapreduce --create --alive \

--instance-type m1.xlarge \

--num-instances 5

The Hadoop Ecosystem

EMR makes it easy to use Hive and Pig

• High-level programming

language (Pig Latin)

• Supports UDFs

• Ideal for data flow/ETL

• Data Warehouse for Hadoop

• SQL-like query language

(HiveQL)

• Language and software

environment for statistical

computing and graphics

• Open source

EMR makes it easy to use other tools and applications

Mahout:

• Machine learning library

• Supports recommendation

mining, clustering,

classification, and frequent

itemset mining

Hive Schema on read

Launch a Hive cluster from the CLI (step 1/1)

./elastic-mapreduce --create --alive \

--name "Test Hive" \

--hadoop-version 0.20 \

--num-instances 5 \

--instance-type m1.large \

--hive-interactive \

--hive-versions 0.7.1

SQL Interface for working with data

Simple way to use Hadoop

Create Table statement references data location on S3

Language called HiveQL, similar to SQL

An example of a query could be: SELECT COUNT(1) FROM sometable;

Requires to setup a mapping to the input data

Uses SerDe:s to make different input formats queryable

Powerful data types (Array & Map..)

SQL HiveQL

Updates UPDATE, INSERT, DELETE

INSERT, OVERWRITE TABLE

Transactions Supported Not supported

Indexes Supported Not supported

Latency Sub-second Minutes

Functions Hundreds Dozens

Multi-table inserts Not supported Supported

Create table as select Not valid SQL-92 Supported

./elastic-mapreduce –create

--name "Hive job flow”

--hive-script

--args s3://myawsbucket/myquery.q

--args -d,INPUT=s3://myawsbucket/input,-

d,OUTPUT=s3://myawsbucket/output

HiveQL to execute

./elastic-mapreduce

--create

--alive

--name "Hive job flow”

--num-instances 5 --instance-type m1.large \

--hive-interactive

Interactive hive session

requestBeginTime: "19191901901",

requestEndTime: "19089012890",

browserCookie: "xFHJK21AS6HLASLHAS",

userCookie: "ajhlasH6JASLHbas8",

searchPhrase: "digital cameras" adId:

"jalhdahu789asashja",

impresssionId: "hjakhlasuhiouasd897asdh",

referrer: "http://cooking.com/recipe?id=10231",

hostname: "ec2-12-12-12-12.ec2.amazonaws.com",

modelId: "asdjhklasd7812hjkasdhl",

processId: "12901", threadId: "112121",

timers:

{ requestTime: "1910121", modelLookup: "1129101" }

counters:

{ heapSpace: "1010120912012" }

requestBeginTime: "19191901901",

requestEndTime: "19089012890",

browserCookie: "xFHJK21AS6HLASLHAS",

userCookie: "ajhlasH6JASLHbas8",

adId: "jalhdahu789asashja",

impresssionId:

hjakhlasuhiouasd897asdh",

clickId: "ashda8ah8asdp1uahipsd",

referrer: "http://recipes.com/",

directedTo: "http://cooking.com/" }

CREATE EXTERNAL TABLE impressions (

requestBeginTime string,

adId string,

impressionId string,

referrer string,

userAgent string,

userCookie string,

ip string

PARTITIONED BY (dt string)

ROW FORMAT

serde 'com.amazon.elasticmapreduce.JsonSerde'

with serdeproperties ( 'paths'='requestBeginTime,

adId, impressionId, referrer, userAgent,

userCookie, ip' )

LOCATION ‘s3://mybucketsource/tables/impressions' ;

adId string,

referrer string,

userAgent string,

userCookie string,

ip string

ROW FORMAT

userCookie, ip' )

Table structure to create

(happens fast as just mapping to

source)

adId string,

referrer string,

userAgent string,

userCookie string,

ip string

ROW FORMAT

userCookie, ip' )

Source data in S3

Hadoop lowers the cost of developing a distributed system.

hive> select * from impressions limit 5;

Selecting from source data directly via Hadoop

What about the cost of operating a distributed system?

November traffic at amazon.com

Innovation #2:

EMR is Hadoop in the Cloud

What is Amazon Elastic MapReduce (EMR)?

1 instance x 100 hours = 100 instances x 1 hour

EMR Cluster

Put the data

into S3

Choose: Hadoop distribution, # of

nodes, types of nodes, custom

configs, Hive/Pig/etc.

Get the output from

Launch the cluster using the

EMR console, CLI, SDK, or

You can also store

everything in HDFS

How does EMR work ?

What can you run on EMR…

EMR Cluster

Resize Nodes

EMR Cluster

You can easily add and

remove nodes

On and Off Fast Growth

Predictable peaks Variable peaks

Fast Growth On and Off

Predictable peaks Variable peaks

Your choice of tools on Hadoop/EMR

Amazon SQS

Amazon S3

DynamoDB

Amazon

SQL based processing

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift

Pre-processing

framework

Petabyte scale

Columnar Data -

warehouse

Massively Parallel Columnar Datawarehouses

• Columnar Data stores

• MPP

– Parallel Ingest

– Parallel Query

– Scale Out

– Parallel Backup

Columnar data stores

• Data alignment and block size in row stores vs. column stores

• Compression based on each column

MPP Data warehouse parallelizes and distributes

everything • Query

• Load

• Backup

• Restore

• Resize

10 GigE

Ingestion

Backup

Restore

JDBC/ODBC

But Data-warehouses are

• Hard to manage

• Very expensive

• Difficult to scale

• Difficult to get performance

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the AWS cloud

Parallelize and Distribute Everything

Dramatically Reduce I/O MPP

Resize

Backup

Restore

Parallelize and Distribute Everything

Dramatically Reduce I/O MPP

Resize

Backup

Restore

Direct-attached storage

Large data block sizes

Column data store

Data compression

Zone maps

Protect Operations

Simplify Provisioning

Redshift data is encrypted

Continuously backed up to S3

Automatic node recovery

Transparent disk failure

Protect Operations

Simplify Provisioning

Redshift data is encrypted

Continuously backed up to S3

Automatic node recovery

Transparent disk failure

Create a cluster in minutes

Automatic OS and software patching

Scale up to 1.6PB with a few clicks and no downtime

Start Small and Grow Big

Extra Large Node (XL)

3 spindles, 2TB, 15GiB RAM

2 virtual cores, 10GigE

1 node (2TB) 2-32 node cluster (64TB)

8 Extra Large Node (8XL)

24 spindles, 16TB, 120GiB RAM

16 virtual cores, 10GigE

2-100 node cluster (1.6PB)

Easy to provision and scale

No upfront costs, pay as you go

High performance at a low price

Open and flexible with support for popular BI tools

Amazon Redshift is priced to let you analyze all your data

Price Per Hour for HS1.XL Single Node

Effective Hourly Price Per TB

Effective Annual Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year Reservation $ 0.500 $ 0.250 $ 2,190

3 Year Reservation $ 0.228 $ 0.114 $ 999

Simple Pricing

Number of Nodes x Cost per Hour

No charge for Leader Node

No upfront costs

Pay as you go

Your choice of BI Tools on the cloud

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift

Pre-processing

framework

Generation

Collaboration and Sharing insights

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift

Sharing results and visualizations

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift

Web App Server

Visualization tools

Sharing results and visualizations and scale

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift

Web App Server

Visualization tools

Sharing results and visualizations

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift Business

Intelligence Tools

Business

Intelligence Tools

Geospatial Visualizations

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Visualization tools

Rinse Repeat every day or hour

Rinse and Repeat

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Amazon data pipeline

The complete architecture

Amazon SQS

Amazon S3

DynamoDB

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools on

hadoop

GIS tools

Amazon data pipeline

How do you start ?

Where do you start ?

• Where is your data ? (S3, SQL, NoSQL ?)

– Are you collecting all your data ?

– What is the format (structured or unstructured)

– How much is this data going to grow ?

• How do you want to process it ?

– SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ?

• How do you want to use this data

– Visualization tools

• Do you yourself or engage an AWS partner

• Write to me sinhaar@amazon.com

Thank You

sinhaar@amazon.com

Big data on_aws in korea by abhishek sinha (lunch and learn)

Technology

Transcript of Big data on_aws in korea by abhishek sinha (lunch and learn)

HIGH COURT OF JHARKHAND,RANCHI...manoj tandon raunak sahay anuj burman shreesha sinha abhishek sinha amit kumar sinha rajiv sinha pinky shaw service relating to state govt.-17100 orders

The Asian Advantage and Founder and CEO, ibibo Group Kaneswaran Avili, Co-Founder and CEO, NIDA Rooms Abhinav Sinha, COO, OYO Abhishek Rajan, Head of Travel Marketplace, Paytm Yogendra

Computer FundFundamentals: Pradeep K. Sinha & Priti Sinha …voccomputerscience.orgfree.com/ComputerFundamental… · · 2012-06-272 Computer FundFundamentals: Pradeep K. Sinha

[AFTER NOTICE (FOR ADMISSION) - CRIMINAL CASES] · kumar sinha[int], abhishek agarwal[int], abhigya[int], v. maheshwari & co.[int], kirti renu mishra[impl], a [int] ... rakesh kumar

Mars Sample Return (MSR) Mission BY: ABHISHEK KUMAR …Mars Sample Return (MSR) Mission BY: ABHISHEK KUMAR SINHA Samples returned to terrestrial laboratories by MSR Mission would be

ABHISHEK KUMAR SINHA (109EE0309) BADAL KUMAR SETHY (109EE0304)ethesis.nitrkl.ac.in/5224/1/109EE0309.pdf · · 2013-12-12SPEED CONTROL OF DC MOTOR USING CHOPPER ABHISHEK KUMAR SINHA

Application No. Name€¦ · 004796 ABHISHEK SINGH 003726 ABHISHEK AGARWAL 000394 ABHISHEK AGARWAL 008429 ABHISHEK AGARWAL 002242 ABHISHEK GAMI 000868 ABHISHEK GUPTA 001664 ABHISHEK

Santosh Medical College & Hospital, Ghaziabad, NCR Delhi ......6. Abhishek Sinha, Rinku Garg, Yogesh Tripathi. Study of association between BMI, body fat and height with visual acuity

· bharat bhushan bhaij pataria vaibhav chauhan rahul gupta sumit kumar gupta venkatagiri ponneda desabathula rajiv shankar shaurya sinha harshit pathak abhishek ranjan parth nandan

ABHISHEK ANUPAM · PORTFOLIO 2019. ABHISHEK ANUPAM photo@abhishekanupam.com +91-9480706778. 44 . Author: Abhishek Anupam Subject: Abhishek Anupam is a Mumbai-based photographer and

avanti.in...ABHINAV GOUD BINGI ABHINAV GUPTA ABHINAV SHARMA ABHINAV SHARMA ABHISHEK AGARWAL ABHISHEK BANSAL ABHISHEK JAIN ABHISHEK KUMAR SHUKLA ABHISHEK MITTAL ABHISHEK P KHANNUR ...

Abhishek Sinha Amazon Web Services · 2013-05-30 · More AWS NFt Lower Infrastructure New Features New Services Infrastructure Innovation Usage Economies More Costs Infrastructure

Shantanu Sinha

Abhishek Sinha* Associate Professor, Department Of Physiology, Santosh … · 2020-01-18 · Abhishek Sinha* Associate Professor, Department Of Physiology, Santosh Medical College,

Computer Fundamentals: Preep K. Sinha & Priti Sinha

3088 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 25, NO. 5, OCTOBER 2017 Throughput-Optimal Multi-Hop Broadcast Algorithms Abhishek Sinha, Georgios Paschos, and Eytan Modiano, Fello

AMC Budget 2015-16 - Agartalaagartalacity.tripura.gov.in/PDF/AMC_Budget_2015-2016.pdf · Abhishek Chandra,IAS Dr. Prafulla Jit Sinha Municipal Commissioner ... Suitable land has been

1 2 7 - jharkhandhighcourt.nic.in · anjani nandan sumeet gadodia shilpi john ritesh kumar gupta ranjeet kushwaha akshay kr mahato abhishek sinha ia no, 11718/2019 code of civil procedure,

Industrial Attachment Report on ³ Sinha Denim Ltd & Sinha ...

Container orchestration on_aws