2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

[email protected] @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist

THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN

DERIVE FROM IT!

THE COST OF DATA GENERATION IS FALLING!

We are constantly producing more data

From all types of industries

GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!


Lower cost, higher throughput


Lower cost, higher throughput

Highly constrained

+ ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS


AWS Import / Export AWS Direct Connect

Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect


Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2


Amazon EC2 Amazon Elastic

MapReduce

AMAZON ELASTIC MAPREDUCE

HADOOP AS A SERVICE!

•  SPLITS DATA INTO PIECES •  LETS PROCESSING OCCUR •  GATHERS THE RESULTS!

HDFS

EMR Kinesis

S3 DynamoDB

Data management

Pig

Analytics languages/engines

RDS

Redshift AWS Data Pipeline

EMR + IMPALA DEMO

STARTING AN EMR CLUSTER WITH HADOOP ECOSYSTEM

TOOLS PRE-INSTALLED

COPY & LOAD OUR DATASET $ scp –i EMRKeyPair.pem ~/aws/hadoop/LHRarrivals*.csv hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐west-‐1.compute.amazonaws.com: $ ssh –i EMRKeyPair.pem hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐west-‐1.compute.amazonaws.com $ hadoop fs -‐mkdir /data/ $ hadoop fs -‐put <uploaded_files> /data/ $ hadoop fs -‐ls -‐h -‐R /data/ or at scale, Distributed Copy using S3DistCp to parallel load from S3 $ . /home/hadoop/impala/conf/impala.conf $ hadoop jar /home/hadoop/lib/emr-‐s3distcp-‐1.0.jar -‐Dmapreduce.job.reduces=30 -‐-‐src s3://s3bucketname/ -‐-‐dest hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/ -‐-‐outputCodec 'none' ** Run on a cluster master node

CREATE EXTERNAL TABLE $ #check the size of our data set $ wc –l LHRarrivals*.csv

850 LHRarrivals2.csv 1526 LHRarrivals.csv

2376 total $ impala-‐shell Welcome to the Impala shell. > create EXTERNAL TABLE flights ( input STRING, id BIGINT, widget STRING, source STRING, resultnum BIGINT, pageurl STRING, scheduled STRING, flightnumber STRING, airport STRING, status STRING, terminal STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/'; > select count (*) from flights; Should return count(*) 2376 reflecting the size of the data set

DEMO OF ODBC ACCESS Doing this part on Amazon WorkSpaces using the Simba Cloudera Impala ODBC Driver.!

Set up an SSH tunnel to the master node to allow us to connect to port 25010 from the WorkSpaces desktop to the Impala ODBC port!

A previously configured system DSN allows us to work with the data from our EMR/Impala cluster directly within Microsoft Excel!


Amazon S3, Amazon DynamoDB,

Amazon RDS, Amazon Redshift,

Data on Amazon EC2


BATCH PROCESSING

GENERATE ➔ ➔ SHARE!

STREAM PROCESSING

AMAZON KINESISREAL-TIME DATA STREAM PROCESSING!

Real-time response to content in semi-structured data streams

Relatively simple computations

on data (aggregates, filters, sliding window, etc.)

Hourly server logs: how your systems went wrong an hour ago

Weekly / Monthly Bill: What you spent this past billing cycle

Daily customer report from your website: tells you what deal or ad to try next time

Daily fraud reports: tells you if there was fraud yesterday

Daily business reports: tells me how customers used AWS services yesterday

Real-time metrics: what just went wrong now

Real-time spending alerts/caps: guaranteeing you can’t overspend

Real-time analysis: what to offer the current customer now

Real-time detection: blocks fraudulent use now

Fast ETL into Amazon Redshift: how are customers using services now




Data on Amazon EC2

Amazon EC2 Amazon Elastic

MapReduce

Amazon S3, Amazon Glacier,

Amazon DynamoDB, Amazon RDS,

Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 AWS Import / Export

AWS Direct Connect


STREAM PROCESSING


STREAM PROCESSING



Data on Amazon EC2

Amazon Kinesis Stream Processing on

Amazon EC2

WANT TO KNOW MORE?

aws.amazon.com/solutions/case-studies/big-data/!

[email protected] @IanMmmm

LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist

2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo

Technology

Transcript of 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo