2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

40
[email protected] @IanMmmm LARGE SCALE DATA ANALYSIS WITH AWS Ian Massingham – Technical Evangelist

Transcript of 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

Page 1: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

[email protected] @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist

Page 2: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN

DERIVE FROM IT!

Page 3: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Page 4: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Page 5: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

THE COST OF DATA GENERATION IS FALLING!

Page 6: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

We are constantly producing more data

Page 7: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

From all types of industries

Page 8: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Page 9: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Page 10: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Page 11: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Lower cost, higher throughput

Page 12: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Lower cost, higher throughput

Highly constrained

Page 13: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

+ ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS

Page 14: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Page 15: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

AWS Import / Export AWS Direct Connect

Page 16: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect

Page 17: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2

Page 18: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon EC2 Amazon Elastic

MapReduce

Page 19: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
Page 20: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

AMAZON ELASTIC MAPREDUCE

HADOOP AS A SERVICE!

Page 21: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

•  SPLITS DATA INTO PIECES •  LETS PROCESSING OCCUR •  GATHERS THE RESULTS!

Page 22: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

HDFS

EMR Kinesis

S3 DynamoDB

Data management

Pig

Analytics languages/engines

RDS

Redshift AWS Data Pipeline

Page 23: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

EMR + IMPALA DEMO

Page 24: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

STARTING AN EMR CLUSTER WITH HADOOP ECOSYSTEM

TOOLS PRE-INSTALLED

Page 25: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

COPY & LOAD OUR DATASET $  scp  –i  EMRKeyPair.pem  ~/aws/hadoop/LHRarrivals*.csv  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com:    $  ssh  –i  EMRKeyPair.pem  hadoop@ec2-­‐54-­‐76-­‐242-­‐238.eu-­‐west-­‐1.compute.amazonaws.com    $  hadoop  fs  -­‐mkdir  /data/  $  hadoop  fs  -­‐put  <uploaded_files>  /data/  $  hadoop  fs  -­‐ls  -­‐h  -­‐R  /data/    or at scale, Distributed Copy using S3DistCp to parallel load from S3  $  .  /home/hadoop/impala/conf/impala.conf  $  hadoop  jar  /home/hadoop/lib/emr-­‐s3distcp-­‐1.0.jar  -­‐Dmapreduce.job.reduces=30  -­‐-­‐src  s3://s3bucketname/  -­‐-­‐dest  hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/  -­‐-­‐outputCodec  'none'    ** Run on a cluster master node

Page 26: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

CREATE EXTERNAL TABLE $  #check  the  size  of  our  data  set  $  wc  –l  LHRarrivals*.csv      

 850  LHRarrivals2.csv    1526  LHRarrivals.csv  

     2376  total    $  impala-­‐shell    Welcome  to  the  Impala  shell.    >  create  EXTERNAL  TABLE  flights  (  input  STRING,  id  BIGINT,  widget  STRING,  source  STRING,  resultnum  BIGINT,  pageurl  STRING,  scheduled  STRING,  flightnumber  STRING,  airport  STRING,  status  STRING,  terminal  STRING  )  ROW  FORMAT  DELIMITED  FIELDS  TERMINATED  BY  ','  LOCATION  '/data/';  >  select  count  (*)  from  flights;    Should  return  count(*)  2376  reflecting  the  size  of  the  data  set  

Page 27: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

DEMO OF ODBC ACCESS Doing this part on Amazon WorkSpaces using the Simba Cloudera Impala ODBC Driver.!

Set up an SSH tunnel to the master node to allow us to connect to port 25010 from the WorkSpaces desktop to the Impala ODBC port!

A previously configured system DSN allows us to work with the data from our EMR/Impala cluster directly within Microsoft Excel!

Page 28: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2

Page 29: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Page 30: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

BATCH PROCESSING

Page 31: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

Page 32: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

AMAZON KINESISREAL-TIME DATA STREAM PROCESSING!

Page 33: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

Real-time response to content in semi-structured data streams

Relatively simple computations

on data (aggregates, filters, sliding window, etc.)

Page 34: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

Hourly server logs: how your systems went wrong an hour ago

Weekly / Monthly Bill: What you spent this past billing cycle

Daily customer report from your website: tells you what deal or ad to try next time

Daily fraud reports: tells you if there was fraud yesterday

Daily business reports: tells me how customers used AWS services yesterday

Real-time metrics: what just went wrong now

Real-time spending alerts/caps: guaranteeing you can’t overspend

Real-time analysis: what to offer the current customer now

Real-time detection: blocks fraudulent use now

Fast ETL into Amazon Redshift: how are customers using services now

Page 35: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Page 36: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!

Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2

Amazon EC2 Amazon Elastic

MapReduce

Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 AWS Import / Export

AWS Direct Connect

Page 37: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

Page 38: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2

Amazon Kinesis Stream Processing on

Amazon EC2

Page 39: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

WANT TO KNOW MORE?

aws.amazon.com/solutions/case-studies/big-data/!

Page 40: 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

[email protected] @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist