2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
-
Upload
ian-massingham -
Category
Technology
-
view
154 -
download
0
Transcript of 2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
[email protected] @IanMmmm
LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist
THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN
DERIVE FROM IT!
THE COST OF DATA GENERATION IS FALLING!
We are constantly producing more data
From all types of industries
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Lower cost, higher throughput
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Lower cost, higher throughput
Highly constrained
+ ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
AWS Import / Export AWS Direct Connect
Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3, Amazon Glacier,
Amazon DynamoDB, Amazon RDS,
Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon EC2 Amazon Elastic
MapReduce
AMAZON ELASTIC MAPREDUCE
HADOOP AS A SERVICE!
• SPLITS DATA INTO PIECES • LETS PROCESSING OCCUR • GATHERS THE RESULTS!
HDFS
EMR Kinesis
S3 DynamoDB
Data management
Pig
Analytics languages/engines
RDS
Redshift AWS Data Pipeline
EMR + IMPALA DEMO
STARTING AN EMR CLUSTER WITH HADOOP ECOSYSTEM
TOOLS PRE-INSTALLED
COPY & LOAD OUR DATASET $ scp –i EMRKeyPair.pem ~/aws/hadoop/LHRarrivals*.csv hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐west-‐1.compute.amazonaws.com: $ ssh –i EMRKeyPair.pem hadoop@ec2-‐54-‐76-‐242-‐238.eu-‐west-‐1.compute.amazonaws.com $ hadoop fs -‐mkdir /data/ $ hadoop fs -‐put <uploaded_files> /data/ $ hadoop fs -‐ls -‐h -‐R /data/ or at scale, Distributed Copy using S3DistCp to parallel load from S3 $ . /home/hadoop/impala/conf/impala.conf $ hadoop jar /home/hadoop/lib/emr-‐s3distcp-‐1.0.jar -‐Dmapreduce.job.reduces=30 -‐-‐src s3://s3bucketname/ -‐-‐dest hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/ -‐-‐outputCodec 'none' ** Run on a cluster master node
CREATE EXTERNAL TABLE $ #check the size of our data set $ wc –l LHRarrivals*.csv
850 LHRarrivals2.csv 1526 LHRarrivals.csv
2376 total $ impala-‐shell Welcome to the Impala shell. > create EXTERNAL TABLE flights ( input STRING, id BIGINT, widget STRING, source STRING, resultnum BIGINT, pageurl STRING, scheduled STRING, flightnumber STRING, airport STRING, status STRING, terminal STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/'; > select count (*) from flights; Should return count(*) 2376 reflecting the size of the data set
DEMO OF ODBC ACCESS Doing this part on Amazon WorkSpaces using the Simba Cloudera Impala ODBC Driver.!
Set up an SSH tunnel to the master node to allow us to connect to port 25010 from the WorkSpaces desktop to the Impala ODBC port!
A previously configured system DSN allows us to work with the data from our EMR/Impala cluster directly within Microsoft Excel!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3, Amazon DynamoDB,
Amazon RDS, Amazon Redshift,
Data on Amazon EC2
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
BATCH PROCESSING
GENERATE ➔ ➔ SHARE!
STREAM PROCESSING
AMAZON KINESISREAL-TIME DATA STREAM PROCESSING!
Real-time response to content in semi-structured data streams
Relatively simple computations
on data (aggregates, filters, sliding window, etc.)
Hourly server logs: how your systems went wrong an hour ago
Weekly / Monthly Bill: What you spent this past billing cycle
Daily customer report from your website: tells you what deal or ad to try next time
Daily fraud reports: tells you if there was fraud yesterday
Daily business reports: tells me how customers used AWS services yesterday
Real-time metrics: what just went wrong now
Real-time spending alerts/caps: guaranteeing you can’t overspend
Real-time analysis: what to offer the current customer now
Real-time detection: blocks fraudulent use now
Fast ETL into Amazon Redshift: how are customers using services now
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
GENERATE ➔ STORE ➔ ANALYZE ➔ SHARE!
Amazon S3, Amazon DynamoDB,
Amazon RDS, Amazon Redshift,
Data on Amazon EC2
Amazon EC2 Amazon Elastic
MapReduce
Amazon S3, Amazon Glacier,
Amazon DynamoDB, Amazon RDS,
Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 AWS Import / Export
AWS Direct Connect
GENERATE ➔ ➔ SHARE!
STREAM PROCESSING
GENERATE ➔ ➔ SHARE!
STREAM PROCESSING
Amazon S3, Amazon DynamoDB,
Amazon RDS, Amazon Redshift,
Data on Amazon EC2
Amazon Kinesis Stream Processing on
Amazon EC2
WANT TO KNOW MORE?
aws.amazon.com/solutions/case-studies/big-data/!
[email protected] @IanMmmm
LARGE SCALE DATA ANALYSIS WITH AWSIan Massingham – Technical Evangelist