20140324 big data_101_v7

31
Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 1 1 Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. Big Data 101 Pradeep Varadan Enterprise Architecture Mar 2014

description

Basic Primer on what is big data, what drives it , the technologies that support it and industry examples

Transcript of 20140324 big data_101_v7

Page 1: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement.

Big Data 101

Pradeep VaradanEnterprise Architecture

Mar 2014

Page 2: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 2

Agenda

• What is Big Data ?

Hype

Facts

Definition

• Why the upsurge ?

Re-thinking data

Rethinking processes

• Technology

Current constraints

RDBMS vs. Hadoop

Hadoop

No SQL

• Use Cases

Cross Industry examples

Netflix

Page 3: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 3

data

Big Data

What is Big Data ?

Page 4: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 4

Social media

Server logs

Web clickstream

Machine/sensor

Geo-location

What is Big Data ?

Hobbyist Desktop Internet Big Data

Kb Gb Pb Zb

Page 5: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 5

“high-volume, -velocity and -variety information assets

that demand cost-effective, innovative forms of

information processing for enhanced insight and decision

making” - Gartner

What is Big Data ?

Page 6: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 6

• What is Big Data ?

Hype

Facts

Definition

• Why the upsurge ?

Re-thinking data

Rethinking processes

• Technology

Current constraints

RDBMS vs. Hadoop

Hadoop

No SQL

• Use Cases

Cross Industry examples

Netflix

Page 7: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 7

TRADITIONAL APPROACH BIG DATA APPROACH

Analyze small subsets of information

Analyze all information

Analyzedinformation

All available information

All available informationanalyzed

Rethinking data #1

Move from samples to populations

Page 8: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 8

TRADITIONAL APPROACH BIG DATA APPROACH

Start with hypothesis andtest against selected data

Explore all data andidentify correlations

Hypothesis Question

DataAnswer

Data Exploration

CorrelationInsight

Let data do the talking

Rethinking data #2

Page 9: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 9

TRADITIONAL APPROACH BIG DATA APPROACH

Carefully cleanse information before any analysis

Analyze information as is, cleanse as needed

Small amount of carefully

organized information

Large amount of

messy information

Fail fast or progress iteratively

Rethinking processes #1

Page 10: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 10

Rethinking processes #2

TRADITIONAL APPROACH BIG DATA APPROACH

Analyze data after it’s been processed and landed in a warehouse

or mart

Analyze data in motion as it’s generated, in real-time

Repository InsightAnalysisData

Data

Insight

Analysis

Provide insight in real time

Page 11: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 11

• What is Big Data ?

Hype

Facts

Definition

• Why the upsurge ?

Re-thinking data

Rethinking processes

• Technology

Current constraints

Hadoop

RDBMS vs. Hadoop

No SQL

• Use Cases

Cross Industry examples

Netflix

Page 12: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 12

Constraints of the current environment

Category Existing

Optimization

Ask

Data Type Structured Unstructured

H/W Scalability Vertical Horizontal

Reliability Pricy H/W Free S/W

Interoperability Closed by Vendor Open source

IO Write less, Read more Write more, Read less

Insight Newspaper/daily Near Real time

Data retention Filtered/Limited Unfiltered/Unlimited

Page 13: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 13

Big Data Technologies

• Hadoop

• NO SQL

• Analytics/Visualization (Out of Scope)

Page 14: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 14

How did Hadoop come about ?

Year Google

2004 GFS, Map Reduce

2005 Sawzall

2006 Big Table

2010 Dremel/F1

…. ……

2012 Spanner

Year Open Source

2006 HDFS

2008 Pig, Hive

2008 HBase

2013 Impala

… ….

? ?

Page 15: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 15

DFS Message Path

MapReduce Processing Msg

DN

TT

DN

TT

DN

TTDN

TTDN

TT

DN

TT

DN

TTDN

TT…

… …

Name

Node

Job Tracker

HDFS: Distributed compute and storage

Page 16: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 16

Map Reduce : visual example

Map Shuffle ReduceDistribute

Page 17: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 17

Hadoop Reference architecture

Page 18: 20140324 big data_101_v7

hadoop - hdfs, map reduce

sqoop - db to hdfs

flume - log to hdfs

hbase - columnar store - big table - key,value

pig - python, ruby, php

hive - sql query

oozie - worklflow co-ordination , xml based, scheduler/job-orchestration

zookeeper - co-ordinator ; misc admin functions: locking, messaging,

mailboxes, leader election

fuse-dfs - hdfs volumes in linux

avro - data serialization/rpc

mahout - machine learning

dumbo - python library for streaming

vaidya - Performance benchmarking framework

chukwa - cluster monitor

Lucene - text search

scribe - log collection

storm - real time processing

Welcome to the zoo!

Page 19: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 19

Hadoop companies

Page 20: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 20

Interfaces to Hadoop

Analytics

Data

Pre

p

CR

M

Page 21: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 21

Hadoop Vs. Relational Databases

• Write first, think later

• Think first, write next

Hadoop : Schema-on-read

RDBMS: Schema-on-write

Page 22: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 22

NO SQL – Not Only SQL

Page 23: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 23

NO SQL Types

• Column family:

Aggregate OLAP oriented, Primary Key is data mapping back to row ids

HBase, Accumulo, Cassandra

– NSA uses Accumulo with cell level security for PRISM

• Document store:

Object Oriented encapsulation ,Encoding (XML, YAML, JSON, and BSON)

MarkLogic, MongoDB, Couchbase

– Metlife uses MongoDB for “The Wall’ /Customer 360 View CRM

• Key-value:

– (key,value) based lookups , Associative array with hash table

– Dynamo, Riak, Voldemort

– LinkedIn used Voldemort behind ‘Who viewed my profile?’

• Graph:

– graph structures with nodes, edges, and properties ; index-free adjacency,

– Neo4J, Allegro, Virtuoso

– TwitLogic semantic web using twitter data

Page 24: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 24

SQL vs. NO SQL

SQL NO SQL

Relational Distributed/Hierarchical

Tables Key Value pairs, Documents, Graphs,

Column families

Pre-defined schema Dynamic schema

Vertically scalable Horizontally scalable

SQL UnQL(more programming)

Complex queries on small data Simple queries on large data

ACID BASE

Vertically scalable Horizontally scalabale

Defined data model Model inside application

Cumbersome set up – DBA Ease of set up

Simple data Complex data

Page 25: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 25

Eventually consistent

“CAP Theorem is a set of basic requirements that describe any distributed system

not just storage or database”

“You cannot have a clustered system that supports all of the

following three qualities: consistency, availability, partition-tolerant” -

CAP Theorem by Prof. Eric Brewer (Berkeley)

Page 26: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 26

Agenda

• What is Big Data ?

Hype

Facts

Definition

• Why the upsurge ?

Re-thinking data

Rethinking processes

• Technology

Current constraints

Hadoop

RDBMS vs. Hadoop

No SQL

• Use Cases

Cross Industry examples

Netflix

Page 27: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 27

Big Data Use Cases

Page 28: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 28

“House of Cards” is one of the first major test cases of this Big Data-

driven creative strategy. Detailed knowledge of Netflix subscriber

viewing preferences clinched their decision to license a remake of the

popular and critically well regarded 1990 BBC miniseries. Netflix’s data

indicated that the same subscribers who loved the original BBC

production also gobbled down movies starring Kevin Spacey or directed

by David Fincher. Therefore, concluded Netflix executives, a remake of

the BBC drama with Spacey and Fincher attached was a no-brainer, to

the point that the company committed $100 million for two 13-episode

seasons.

Use Cases

Page 29: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 29

Where are we headed ?

• H/W

– Couch - cluster of unreliable commodity hardware

– Software defined storage reliability

• S/W

– HDFS will be the new UNIX (distributed FS)

– Open Source software

• Data Ingestion

– Online transactions + Batch file + Streaming torrents

• Technical Architecture

– Shared nothing

– Data centric (Process will move to data)

• Backup and recovery ?

• Scalability

– Horizontal

– Vertical

• Mixed workloads

Page 31: 20140324 big data_101_v7

Confidential and proprietary materials for authorized Verizon personnel and outside agencies only. Use, disclosure or distribution of this material is not permitted to any unauthorized persons or third parties except by written agreement. 31

Thank you

pradeepvaradan [email protected]