Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

23

Transcript of Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Page 1: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Page 2: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Josh Patterson

Email:

[email protected]

Twitter:

@jpatanooga

Github:

https://github.com/jpatanooga

Slideshare

http://www.slideshare.net/jpatanooga/

Past

Published in IAAI-09:

“TinyTermite: A Secure Routing Algorithm”

Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

Hadoop and the Smartgrid

Cloudera

Principal Solution Architect

Today: Patterson Consulting

Page 3: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Overview

• What is Hadoop?

• Hadoop and Industry

• Is Hadoop for Me?

Page 4: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Page 5: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Hadoop Distributed File

System (HDFS)

MapReduce

Apache Hadoop

• Consolidates Mixed Data• Move complex and relational

data into a single repository

• Stores Inexpensively• Keep raw data always available

• Use industry standard hardware

• Processes at the Source• Eliminate ETL bottlenecks

• Mine data first, govern later

5

Page 6: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Why is it Called Hadoop?

Doug’s son had a toy elephant that he

called “Ha-Doop”

Doug Cutting Invented Hadoop.

Page 7: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

What Hadoop Does

Uses commodity hardware / servers

Scales into Petabytes without hardware changes

Manages fault tolerance and replication with its distributed file system

Scalable processing engine handles all types of data

Text, logs, documents

Binary, images, video

Page 8: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Hadoop Distributed File System (HDFS)

Based on design of Google’s GFS

Meant for high levels of throughput which sustain

Map Reduce parallel processing jobs

Data stored in large files

Large blocks, (64MB, 128MB, 256MB, etc) per block

Page 9: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

MapReduce: Distributed Processing

Page 10: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Hadoop Analysis Tools

Java Map Reduce

Hive and Impala

SQL-like language for Hadoop

Declarative higher level language

Pig

Procedural higher level language

Filters, joins, udfs

10

Page 11: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Page 12: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Starting Out: 2008

Got going at Facebook and Yahoo

Became the backbone of many CA startups

In 2009 we did a POC with Hadoop @ TVA

http://openpdc.codeplex.com/

Page 13: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Source: IDC White Paper - sponsored by EMC.

As the Economy Contracts, the Digital Universe Expands. May 2009.

.

Unstructured Data Explosion

• 2,500 exabytes of new information in 2012 with Internet as primary driver

Relational

Complex, Unstructured

(You)

Page 14: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Financial Services Got Interested in Hadoop

Banks saw the potential to look at a lot of transactions for

things like

Fraud

Money Laundering Detection

Comprehensive Credit Reports

Teradata had become very expensive

Started using Hadoop to augment mass data transforms

Page 15: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Genomics

A genome is 2.8GB

There are lots of genomes

Genomics (ISB, Novartis, etc) became very interesting on Hadoop around 2010 and 2011

There are many CPU bound processes in genomics

But a lot of it is also disk bound – great for Hadoop

Page 16: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Other Verticals Jumped In

Telecoms

Lot of call histories to look at, Billing, etc

Media

Help recommend folks stuff to watch based on what they watch

Manufacturing

Sensor data on devices works well as timeseries in Hadoop

Insurance

Lots of data can build better models of how folks live

Can give insurance co’s better ways to model policies

Page 17: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Page 18: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Data is Not Always Big

But the world is becoming progressively more interested in data

Big and small

Data analysis is driving how we build new products, manage our lives, and consume content

Focus on producing a result that is relevant to the industry

And not on how much data you have or if it qualifies as “big”

Page 19: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

ETL Pipelines

Many early Hadoop use cases involve porting a data

transform pipeline into Hive or MapReduce

Allows for linearly scalability on throughput

Processing web logs has always been the

MapReduce base case

Many times the result data is sent back to an

RDBMS store

Page 20: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Recommend Products

Ever used Facebook’s “People you may Know”?

Ever used Amazon’s “People Also Bought”?

These are recommendation systems

Hadoop powers both of these

Deep Dive into Recommenders on Hadoop

http://www.slideshare.net/jpatanooga/la-hug-dec-2011-recommendation-talk

Page 21: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Hadoop as Next-Gen Data Warehouse

Easier to work with data where it lives

Less dependent on DBAs and schemas

Hadoop is Community driven

Is spiritually very similar to Linux

Open source core

With Distro model (think RedHat and Ubuntu)

Page 22: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Is Hadoop For My Use Case?

Am I doing a lot of table / disk based transforms?

The process is disk bound and batch-oriented, so yes

Do I need to ad-hoc query large amounts of data?

Hive or Impala make sense here

Am I dealing with a lot of incoming transactional data that I’d like to analyze (logs, typically)?

Hadoop is great for cheap storage and scalable processing

Page 23: Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Questions?

Thanks for coming out to hear about Hadoop!