Petascale Analytics - The World of Big Data Requires Big Analytics
-
Upload
heiko-joerg-schick -
Category
Technology
-
view
19 -
download
3
description
Transcript of Petascale Analytics - The World of Big Data Requires Big Analytics
© 2011 IBM Corporation
Petascale AnalyticsThe World of Big Data Requires Big Analytics
October 2011
H. J. Schick
IBM Germany Research & Development GmbH
© 2011 IBM Corporation2
Source: The Evolution of Live in 60 Seconds
© 2011 IBM Corporation
© 2011 IBM Corporation4Source: Realtime Apache Hadoop at Facebook
© 2011 IBM Corporation5
© 2011 IBM Corporation6
© 2011 IBM Corporation7
Quiz: What comes after zettabyte?
1 yottabyte = 1,000,000,000,000,000,000,000,000 bytes
© 2011 IBM Corporation8 Exp
eri
men
t:
© 2011 IBM Corporation9Source: IDC Digital Universe Study, sponsored by EMC, May 2011
© 2011 IBM Corporation
Google’s Server Design
10
Source: cnet News, Google Uncloaks Once-Secret Server, April 2009
© 2011 IBM Corporation
The Digital Universe is a Perpetual Tsunami
1. How will we find the information we need when we need it?
2. How will we know what information we need to keep, and how will we keep it?
3. How will we follow the growing number of government and industry rules about retaining records tracking transactions, ensuring information privacy?
4. How will we protect the information we need to protect?
» Solution:
New search and discovery tools Ways to add structure to unstructured data New storage and information management technique More compliance tools Better Security
11
© 2011 IBM Corporation12Source: McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity, May 2011
© 2011 IBM Corporation13 13
Learning Systems(XXI Century)
Era of Natural PhilosophyEra of Modern
Science
Ind
ust
rial
Rev
olu
tio
nAstronomy (Babylon, 1900 BC)
Platonic Academy (387 BC)
Mathematics (India, 499 BC)
Scientific Revolution(1543 AD)
Newton’s Laws
(1687 AD)Relativity(1905 AD)
Quantum Physics
(1925 AD)
Computing(1946 AD)
DNA(1953 AD)
Evo
luti
on
of
Sci
ence
Time
The Evolution of Science
© 2011 IBM Corporation14 14
Algorithms and ApplicationsStatic programming
Archives
Structured Data and TextThe Calculating
Paradigm
People Hypothesize,Determine “what it means”,Run other applications…
Today’s Systems – The Calculating Paradigm
© 2011 IBM Corporation15 15
Future Systems – The Learning Paradigm
Training and Learning EnginesTo Build Models and Define Insight
Hypothesis EnginesTo Understand and Plan Actions
Policy EngineBusiness, Legal
and Ethical Rules
Verification Engines(e.g. Simulations)
Active Learning
(Natural Interfaces)
Outcome EngineActuation and Validation
Society Nature Institutions Archives
© 2011 IBM Corporation16
Up to 10,000 Times larger
Up to 10,000 times faster
Traditional Data Warehouse and Business Intelligence
Dat
a S
cale
yr mo wk day hr min sec … ms s
Exa
Peta
Tera
Giga
Mega
Kilo
Decision FrequencyOccasional Frequent Real-time
Data in Motion
Da
ta a
t R
es
t
New “Big Data” Brings New Opportunities, Requires New Analytics
Telco Promotions100,000 records/sec, 6B/day10 ms/decision270TB for Deep Analytics
DeepQA
100s GB for Deep Analytics 3 sec/decision
Smart Traffic250K GPS probes/sec630K segments/sec2 ms/decision, 4K vehicles
Homeland Security600,000 records/sec, 50B/day1-2 ms/decision320TB for Deep Analytics
© 2011 IBM Corporation17
Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet
Peta2
AnalyticsAppliance
+ +
Reactive + DeepAnalytics Platform
Big Analytics Ecosystem
Peta2 Data-centricSystem
Algorithms
Big Data
Skills
DeepEyesWebcam Fusion
DeepCurrentPower Delivery
DeepSafetyPolice/Security
DeepTrafficArea Traffic Prediction
DeepFriendsSocial Network Monitor
DeepWaterWater management
DeepBasketFood Market Prediction
DeepBreathAir Quality Control
DeepPulsePolitical Polling
DeepResponseEmergency Coordination
DeepThunderLocal Weather Prediction
DeepSoilFarm Prediction
© 2011 IBM Corporation18 18
Statistical Ensembleof 600 to 800
Scoring Engines
~30 Machine Learning Models Weigh Scores, Produce
Confidence for Each Question 0<P<1
Hypothetical Question With Greatest Confidence is Chosen
Evidence-Based
Decision Support System
Evidence-Based
Decision Support System
S1S1
S2S2
S3S3
SNSN
. . .
StaticData
Corpus
StaticData
Corpus
Answer: A large country in the Western Hemisphere whose capital has a similar name.
Hypothesis Generated from “Answer”Guess Questions Q1, Q2 … Qi
Hypothesis Generated from “Answer”Guess Questions Q1, Q2 … Qi
Question: What is Brazil?
Element Refresh Time
DataCorpus
2 Weeks
Hypothesis Engines
Weeks toMonths
Scoring Engines
Weeks toMonths
Decision Support Engine
4 Days
Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds
Watson
3,000 cores;100 TFlops2 TB memory
~ 200 KW
© 2011 IBM Corporation19
© 2011 IBM Corporation20
Exascale Research and DevelopmentS
ource: Exascale R
esearch and Developm
ent – Request for Inform
ation, July 2011
© 2011 IBM Corporation21
Big Data Systems Require a Data-centric Architecture for Performance
Data lives on disk and tapeMove data to CPU as neededDeep Storage Hierarchy
Data lives in persistent memoryMany CPU’s surround and useShallow/Flat Storage Hierarchy
Old Compute-centric Model New Data-centric Model
Massive ParallelismPersistent Memory
Largest change in system architecture since the System 360 Huge impact on hardware, systems software, and application design
Flash Phase Change
Manycore FPGA
inputoutput
© 2011 IBM Corporation22
Scale-in is the New Systems Battlefield
FLASH SSD
3D ChipsFPGAManycore BPRAM/SCM
Interconnect In-mem DB DAS
Scale-inMaximize system density
Minimize end-to-end latency
Sys
tem
Cap
acit
y (c
apab
ility
)
Sin
gle
De
vice
De
vice
Clu
ste
rs
100K
10K
1K
100
10
High
Med
Low Scale-down
Sc
ale
-up
Scale-in
Exascale
Peta2
Low Med High Extreme
System Density (1/Latency end-to-end)
Device ClustersSingle Device
Low Med High
Physical Limits
Sca
le-o
ut
NASBlade Server
Scale-outMaximize system capacity
Terabyte HDDPOWER 7
Scale-upMaximize device capacity
AtomTransistor
AtomStorage
Scale-downMaximize feature density
Cloud Computing
© 2011 IBM Corporation
HDD cost advantage continues, 1/10 SCM cost, but
SCM dominates in performance, 10,000x faster than HDD
Storage Class Memory - The Tipping Point for Data-centric Systems
Relative Cost
Relative Latency
DRAM 100 1
SCM 1 10
FLASH 15 1000
HDD 0.1 100000
Source: Chung Lam, IBM
FLASH
(Phase Change)
SCM in 2015$0.05 per GB$50K per PB$0.10 / GB
$0.01 / GB
23
© 2011 IBM Corporation24
Background: 3 Styles of Massively Parallel Systems
Data in Motion: High Velocity Mixed Variety High Volume*
(*over time)
SPL, C, Java
Data at Rest*: High Volume Mixed Variety Low Velocity
Deep AnalyticsExtreme Scale-out
(*pre-partitioned)
Simulation (BlueGene)
Generative Modeling Extreme Physics
C/C++, Fortran, MPI, OpenMP
= compute node
Reactive Analytics Extreme Ingestion
Streaming (Streams)
Long Running Small InputMassive Output
Hadoop/MapReduce (BigInsights)JAQL, Java
Reducers
Mappers
Input Data (on disk)
Output Data
© 2011 IBM Corporation25
© 2011 IBM Corporation
Fault-tolerant Hadoop Distributed File System (HDFS)
26
Source: Hadoop Overview, http://www.cloudera.com
© 2011 IBM Corporation
Map Reduce Logical Data Flow
27
Source: O’Reilly, Hadoop – The Definitive Guide
0067011990999991950051507004...9999999N9+00001+99999999999...0043011990999991950051512004...9999999N9+00221+99999999999...0043011990999991950051518004...9999999N9-00111+99999999999...0043012650999991949032412004...0500001N9+01111+99999999999...0043012650999991949032418004...0500001N9+00781+99999999999...
1
© 2011 IBM Corporation
Map Reduce Logical Data Flow
28
Source: O’Reilly, Hadoop – The Definitive Guide
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
2
© 2011 IBM Corporation
Map Reduce Logical Data Flow
29
Source: O’Reilly, Hadoop – The Definitive Guide
(1950, 0)(1950, 22)(1950, −11)(1949, 111)(1949, 78)
3
© 2011 IBM Corporation
Map Reduce Logical Data Flow
30
Source: O’Reilly, Hadoop – The Definitive Guide
(1949, [111, 78])(1950, [0, 22, −11])
4
© 2011 IBM Corporation
Map Reduce Logical Data Flow
31
Source: O’Reilly, Hadoop – The Definitive Guide
(1949, 111)(1950, 22)
5
© 2011 IBM Corporation
The Blue Gene/Q ASIC
32
Sou
rce:
ED
N N
ews,
Hot
Chi
ps:
The
Puz
zle
of M
any
Cor
es
© 2011 IBM Corporation
The Blue Gene/Q Packaging Hierarchy
33
Source: The Register, IBM’s Blue Gene/Q Super Chip Grows 18th Core
© 2011 IBM Corporation
Opportunity: Blue Gene Active Storage
Flash Capacity 320 GB
I/O Bandwidth 1.5 GB/s
IOPS 207,000+
Nodes 512
Storage Cap 640 TB
I/O Bandwidth 768 GB/s
Random IOPS 100 Million
Compute 104 TF
Bi-Section BW 512 GB/s
51
2 B
GQ
Fla
sh C
ard
sBlue Gene/Q
Active Storage Rack
… scale it like BG/Q.
BG/Q + Flash Memory => Blue Gene Active Storage (BGAS)
BGAS Capabilities Per Rack
• 104 TeraFLOPS – 512 nodes, 8196 cores -- 50% of Standard BG/Q System
• 512 GB/s Bi-Section Bandwidth - All-to-All Throughput of 2GB/s per Node
• 768 GB/s I/O bandwidth – 100TB Sort in ~330 sec (vs 10,000 sec today)
• 100 Million IOPS – Equivalent to order 1 Million Disks
Research and Development Challenges:
• Packaging: integrate Flash today, tomorrow PCM, Memristor, Racetrack, etc.
• System Software: Persistent Memory Management, k-v Store on BGAS
• Resilience: Single Path to Storage, BG/Q Network for General Workloads
• Integration: System Management, Middleware and Frameworks, Applications
© 2011 IBM Corporation
NAND Flash Challenges
1. Need to erase before writing
2. Data retention errors
3. Limited number of writes
4. Management of initial and runtime bad blocks
5. Data errors cause by read and write disturb
» Factors that influence reliability, performance, write endurance:
• Use of Single Level Cell (SLC) and Multi Level Cell (MLC) NAND technology• Wear out mechanism that limits service life: Wear-leveling algorithm• Ensuring data integrity through bad block management techniques• Use of error detection and correction algorithms• Write amplification
35
© 2011 IBM Corporation
Gartner’s Hype Cycle
36
© 2011 IBM Corporation37
Thank you very much for your attention.