Petascale Analytics - The World of Big Data Requires Big Analytics

37
© 2011 IBM Corporation Petascale Analytics The World of Big Data Requires Big Analytics October 2011 H. J. Schick IBM Germany Research & Development GmbH

description

 

Transcript of Petascale Analytics - The World of Big Data Requires Big Analytics

Page 1: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Petascale AnalyticsThe World of Big Data Requires Big Analytics

October 2011

H. J. Schick

IBM Germany Research & Development GmbH

Page 2: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation2

Source: The Evolution of Live in 60 Seconds

Page 3: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Page 4: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation4Source: Realtime Apache Hadoop at Facebook

Page 5: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation5

Page 6: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation6

Page 7: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation7

Quiz: What comes after zettabyte?

1 yottabyte = 1,000,000,000,000,000,000,000,000 bytes

Page 8: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation8 Exp

eri

men

t:

Page 9: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation9Source: IDC Digital Universe Study, sponsored by EMC, May 2011

Page 10: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Google’s Server Design

10

Source: cnet News, Google Uncloaks Once-Secret Server, April 2009

Page 11: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

The Digital Universe is a Perpetual Tsunami

1. How will we find the information we need when we need it?

2. How will we know what information we need to keep, and how will we keep it?

3. How will we follow the growing number of government and industry rules about retaining records tracking transactions, ensuring information privacy?

4. How will we protect the information we need to protect?

» Solution:

New search and discovery tools Ways to add structure to unstructured data New storage and information management technique More compliance tools Better Security

11

Page 12: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation12Source: McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity, May 2011

Page 13: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation13 13

Learning Systems(XXI Century)

Era of Natural PhilosophyEra of Modern

Science

Ind

ust

rial

Rev

olu

tio

nAstronomy (Babylon, 1900 BC)

Platonic Academy (387 BC)

Mathematics (India, 499 BC)

Scientific Revolution(1543 AD)

Newton’s Laws

(1687 AD)Relativity(1905 AD)

Quantum Physics

(1925 AD)

Computing(1946 AD)

DNA(1953 AD)

Evo

luti

on

of

Sci

ence

Time

The Evolution of Science

Page 14: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation14 14

Algorithms and ApplicationsStatic programming

Archives

Structured Data and TextThe Calculating

Paradigm

People Hypothesize,Determine “what it means”,Run other applications…

Today’s Systems – The Calculating Paradigm

Page 15: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation15 15

Future Systems – The Learning Paradigm

Training and Learning EnginesTo Build Models and Define Insight

Hypothesis EnginesTo Understand and Plan Actions

Policy EngineBusiness, Legal

and Ethical Rules

Verification Engines(e.g. Simulations)

Active Learning

(Natural Interfaces)

Outcome EngineActuation and Validation

Society Nature Institutions Archives

Page 16: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation16

Up to 10,000 Times larger

Up to 10,000 times faster

Traditional Data Warehouse and Business Intelligence

Dat

a S

cale

yr mo wk day hr min sec … ms s

Exa

Peta

Tera

Giga

Mega

Kilo

Decision FrequencyOccasional Frequent Real-time

Data in Motion

Da

ta a

t R

es

t

New “Big Data” Brings New Opportunities, Requires New Analytics

Telco Promotions100,000 records/sec, 6B/day10 ms/decision270TB for Deep Analytics

DeepQA

100s GB for Deep Analytics 3 sec/decision

Smart Traffic250K GPS probes/sec630K segments/sec2 ms/decision, 4K vehicles

Homeland Security600,000 records/sec, 50B/day1-2 ms/decision320TB for Deep Analytics

Page 17: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation17

Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet

Peta2

AnalyticsAppliance

+ +

Reactive + DeepAnalytics Platform

Big Analytics Ecosystem

Peta2 Data-centricSystem

Algorithms

Big Data

Skills

DeepEyesWebcam Fusion

DeepCurrentPower Delivery

DeepSafetyPolice/Security

DeepTrafficArea Traffic Prediction

DeepFriendsSocial Network Monitor

DeepWaterWater management

DeepBasketFood Market Prediction

DeepBreathAir Quality Control

DeepPulsePolitical Polling

DeepResponseEmergency Coordination

DeepThunderLocal Weather Prediction

DeepSoilFarm Prediction

Page 18: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation18 18

Statistical Ensembleof 600 to 800

Scoring Engines

~30 Machine Learning Models Weigh Scores, Produce

Confidence for Each Question 0<P<1

Hypothetical Question With Greatest Confidence is Chosen

Evidence-Based

Decision Support System

Evidence-Based

Decision Support System

S1S1

S2S2

S3S3

SNSN

. . .

StaticData

Corpus

StaticData

Corpus

Answer: A large country in the Western Hemisphere whose capital has a similar name.

Hypothesis Generated from “Answer”Guess Questions Q1, Q2 … Qi

Hypothesis Generated from “Answer”Guess Questions Q1, Q2 … Qi

Question: What is Brazil?

Element Refresh Time

DataCorpus

2 Weeks

Hypothesis Engines

Weeks toMonths

Scoring Engines

Weeks toMonths

Decision Support Engine

4 Days

Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds

Watson

3,000 cores;100 TFlops2 TB memory

~ 200 KW

Page 19: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation19

Page 20: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation20

Exascale Research and DevelopmentS

ource: Exascale R

esearch and Developm

ent – Request for Inform

ation, July 2011

Page 21: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation21

Big Data Systems Require a Data-centric Architecture for Performance

Data lives on disk and tapeMove data to CPU as neededDeep Storage Hierarchy

Data lives in persistent memoryMany CPU’s surround and useShallow/Flat Storage Hierarchy

Old Compute-centric Model New Data-centric Model

Massive ParallelismPersistent Memory

Largest change in system architecture since the System 360 Huge impact on hardware, systems software, and application design

Flash Phase Change

Manycore FPGA

inputoutput

Page 22: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation22

Scale-in is the New Systems Battlefield

FLASH SSD

3D ChipsFPGAManycore BPRAM/SCM

Interconnect In-mem DB DAS

Scale-inMaximize system density

Minimize end-to-end latency

Sys

tem

Cap

acit

y (c

apab

ility

)

Sin

gle

De

vice

De

vice

Clu

ste

rs

100K

10K

1K

100

10

High

Med

Low Scale-down

Sc

ale

-up

Scale-in

Exascale

Peta2

Low Med High Extreme

System Density (1/Latency end-to-end)

Device ClustersSingle Device

Low Med High

Physical Limits

Sca

le-o

ut

NASBlade Server

Scale-outMaximize system capacity

Terabyte HDDPOWER 7

Scale-upMaximize device capacity

AtomTransistor

AtomStorage

Scale-downMaximize feature density

Cloud Computing

Page 23: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

HDD cost advantage continues, 1/10 SCM cost, but

SCM dominates in performance, 10,000x faster than HDD

Storage Class Memory - The Tipping Point for Data-centric Systems

Relative Cost

Relative Latency

DRAM 100 1

SCM 1 10

FLASH 15 1000

HDD 0.1 100000

Source: Chung Lam, IBM

FLASH

(Phase Change)

SCM in 2015$0.05 per GB$50K per PB$0.10 / GB

$0.01 / GB

23

Page 24: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation24

Background: 3 Styles of Massively Parallel Systems

Data in Motion: High Velocity Mixed Variety High Volume*

(*over time)

SPL, C, Java

Data at Rest*: High Volume Mixed Variety Low Velocity

Deep AnalyticsExtreme Scale-out

(*pre-partitioned)

Simulation (BlueGene)

Generative Modeling Extreme Physics

C/C++, Fortran, MPI, OpenMP

= compute node

Reactive Analytics Extreme Ingestion

Streaming (Streams)

Long Running Small InputMassive Output

Hadoop/MapReduce (BigInsights)JAQL, Java

Reducers

Mappers

Input Data (on disk)

Output Data

Page 25: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation25

Page 26: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Fault-tolerant Hadoop Distributed File System (HDFS)

26

Source: Hadoop Overview, http://www.cloudera.com

Page 27: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Map Reduce Logical Data Flow

27

Source: O’Reilly, Hadoop – The Definitive Guide

0067011990999991950051507004...9999999N9+00001+99999999999...0043011990999991950051512004...9999999N9+00221+99999999999...0043011990999991950051518004...9999999N9-00111+99999999999...0043012650999991949032412004...0500001N9+01111+99999999999...0043012650999991949032418004...0500001N9+00781+99999999999...

1

Page 28: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Map Reduce Logical Data Flow

28

Source: O’Reilly, Hadoop – The Definitive Guide

(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)

2

Page 29: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Map Reduce Logical Data Flow

29

Source: O’Reilly, Hadoop – The Definitive Guide

(1950, 0)(1950, 22)(1950, −11)(1949, 111)(1949, 78)

3

Page 30: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Map Reduce Logical Data Flow

30

Source: O’Reilly, Hadoop – The Definitive Guide

(1949, [111, 78])(1950, [0, 22, −11])

4

Page 31: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Map Reduce Logical Data Flow

31

Source: O’Reilly, Hadoop – The Definitive Guide

(1949, 111)(1950, 22)

5

Page 32: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

The Blue Gene/Q ASIC

32

Sou

rce:

ED

N N

ews,

Hot

Chi

ps:

The

Puz

zle

of M

any

Cor

es

Page 33: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

The Blue Gene/Q Packaging Hierarchy

33

Source: The Register, IBM’s Blue Gene/Q Super Chip Grows 18th Core

Page 34: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Opportunity: Blue Gene Active Storage

Flash Capacity 320 GB

I/O Bandwidth 1.5 GB/s

IOPS 207,000+

Nodes 512

Storage Cap 640 TB

I/O Bandwidth 768 GB/s

Random IOPS 100 Million

Compute 104 TF

Bi-Section BW 512 GB/s

51

2 B

GQ

Fla

sh C

ard

sBlue Gene/Q

Active Storage Rack

… scale it like BG/Q.

BG/Q + Flash Memory => Blue Gene Active Storage (BGAS)

BGAS Capabilities Per Rack

• 104 TeraFLOPS – 512 nodes, 8196 cores -- 50% of Standard BG/Q System

• 512 GB/s Bi-Section Bandwidth - All-to-All Throughput of 2GB/s per Node

• 768 GB/s I/O bandwidth – 100TB Sort in ~330 sec (vs 10,000 sec today)

• 100 Million IOPS – Equivalent to order 1 Million Disks

Research and Development Challenges:

• Packaging: integrate Flash today, tomorrow PCM, Memristor, Racetrack, etc.

• System Software: Persistent Memory Management, k-v Store on BGAS

• Resilience: Single Path to Storage, BG/Q Network for General Workloads

• Integration: System Management, Middleware and Frameworks, Applications

Page 35: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

NAND Flash Challenges

1. Need to erase before writing

2. Data retention errors

3. Limited number of writes

4. Management of initial and runtime bad blocks

5. Data errors cause by read and write disturb

» Factors that influence reliability, performance, write endurance:

• Use of Single Level Cell (SLC) and Multi Level Cell (MLC) NAND technology• Wear out mechanism that limits service life: Wear-leveling algorithm• Ensuring data integrity through bad block management techniques• Use of error detection and correction algorithms• Write amplification

35

Page 36: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation

Gartner’s Hype Cycle

36

Page 37: Petascale Analytics - The World of Big Data Requires Big Analytics

© 2011 IBM Corporation37

Thank you very much for your attention.