Do Big Webinar: Location Based Services - Enabling on-the-go businesses
CS 644: Introduction to Big Data Chapter 8. Enabling Big-data...
Transcript of CS 644: Introduction to Big Data Chapter 8. Enabling Big-data...
CS 644: Introduction to Big Data
Chapter 8. Enabling Big-data Scientific Workflows in High-performance Networks
1
Chase Wu
New Jersey Institute of Technology
Outline
• Introduction• Challenges & Objectives• A Three-layer Architecture Solution
- Enabling Technologies: networking and computing• Networking for Big Data
- Software-defined Networking (SDN)- High-performance Networking (HPN)
• Computing for Big Data- Workflow Management and Optimization
• Simulation/Experimental Results• Conclusion
2
Introduction• Supercomputing for big-data science
3
AstrophysicsComputational biology
Climate research
Flow dynamics
Computational materialsFusion simulation
Neutron sciences
Nanoscience
Terascale Supernova Initiative (TSI)
4
Visualization channel
Visualization control channelComputation steering channel
• Collaborative project- Supernova explosion
• TSI simulation- 1 terabyte a day with a
small portion of parameters
- From TSI to PSI• Transfer to remote sites
- Interactive distributed visualization
- Collaborative data analysis
- Computation monitoring- Computation steering
Supercomputer or ClusterClient
Challenges for Extreme-scale Scientific Applications• A typical way of research conduct
- Run simulation code on supercomputer in a batch modeØ One day: 1 terabyte datasets
- Move datasets to HPSSØ About 8 hours
- Transfer datasets to remote sites over the InternetØ TCP-based transfer tools: up to one week
- Filter out data of interest- Partition dataset for parallel processing- Extract geometry data- Generate images on rendering engine- Display results on desktop, laptop, powerwall, etc.
}5
Visualization:
several hours or days
Start over if any parameter values are not set appropriately!
Challenges in Modern Sciences
• BIG DATA: from T to P, to E, to Z, to Y, and beyond…- Simulation
• Astrophysics, climate modeling, combustion research, etc.
- Experimental• Spallation Neutron Source, Large Hadron Collider, etc.
- Observational• Large-scale sensor networks, astronomical image data (Dark
Energy Camera), etc.
No matter which type of data is considered, we needan end-to-end workflow solution
for data transfer, processing, and analysis!
6
Big-data Scientific Workflows• Require massively distributed resources
- Hardware• Computing facilities, storage systems, special rendering engines, display devices
(tiled display, powerwall, etc.), network infrastructure, etc.
- Software• Domain-specific data analytics/processing tools, programs, etc.
- Data type• Real-time, archival
• Feature different complexities- Simple case: linear pipeline (a special case of DAG)- Complex case: DAG-structured graph
• Support different application types- Interactive: minimize total end-to-end delay for fast response- Streaming: maximize frame rate to achieve smooth data flow
7
8
- Support distributed workflows in heterogeneous environments
- Optimize workflow performance to meet various user requirementsØ Delay, throughput, reliability, etc.Ø Remote visualization, online computational monitoring
and steering, etc.- Make the best use of computing and networking
resources
Ultimate Goals
Solution: A Three-layer Architecture
9
Enabling Technologies
10
• Three layers- Top: Abstract scientific workflow- Middle: Virtual overlay network (grid, cloud)- Bottom: Physical high-performance network
• Top and bottom layers meet at middle layer- From bottom to middle: resource abstraction
• Bandwidth scheduling• Performance modeling and prediction
- From top to middle: workflow mapping• Optimization: where to execute modules?
• Workflow execution- Actual data transfer: transport control- Actual module running: job scheduling
Networking Requirements• Provision dedicated channels to meet
different transport objectives- High bandwidths
• Multiples of 10Gbps to terabits networking• Support bulk data transfers
- Stable bandwidths• 100s of Mbps• Support interactive control operations
• Why not the Internet?- Only backbone has high bandwidths (last mile)- Packet-level resource sharing- Best-effort IP routing- TCP: hard to sustain 10s Gbps or to stabilize
11
12
An Overview of TCP/IP Stack
Software-Defined Networking
• The Concept of Virtualization• Virtualization of Computing• Virtualization of Networking• Software-Defined Network• Possible Directions
13
Concept of Virtualization
• Decoupling HW/SW• Abstraction and layering• Using, demanding, but not owning or
configuring• Resource pool: flexible to slice, resize,
combine, and distribute• A degree of automation by software
14
Benefits of Virtualization
• An analogy: owning a huge house- Real estate, immovable property- Does not generate cash and income
• How to gain more profit?- Divide this huge house into suites, and RENT to
people!- Renting suites: using but not owning- Transform a static investment into cash
generators!!!
15
Virtualization of Computing• Partitioning one physical machine
- Virtual instances• Running concurrently, sharing resources
• Hypervisor: Virtual Machine Monitor (VMM)- A software layer presents abstraction of physical resources
Key Factor of Virtualization
16
NetworksareHardtoManage
• Operatinganetworkisexpensive- Morethanhalfthecostofanetwork- Yet,operatorerrorcausesmostoutages
• Buggysoftwareintheequipment- Routerswith20+millionlinesofcode- Cascadingfailures,vulnerabilities,etc.
• Thenetworkis“intheway”- Especiallyaproblemindatacenters- …andhomenetworks
17
TraditionalComputerNetworks
Collect measurements and configure the equipment
Management plane:
18
SoftwareDefinedNetworking(SDN)
API to the data plane(e.g., OpenFlow)
Logically-centralized control
Switches
Smart, slow
Dumb, fast
19
OpenFlow Protocol
NETWORK OPERATING SYSTEM
Bandwidth - on -
Demand
DynamicOptical Bypass
Unified Recovery
UnifiedControl Plane
Switch Abstraction
Networking Applications
VIRTUALIZATION (SLICING) PLANE
Underlying DataPlane Switching
Traffic Engineering
Application-Aware
QoS
Provide Choices
Packet Switch
Packet Switch
Wavelength Switch
Time-slotSwitch
Multi-layerSwitch
Packet & Circuit Switch
Packet & Circuit Switch
20
Architecture
Onix / Network OS
Logical Forwarding Plane
Control Plane / Applications
Network Hypervisor
Real States
Logical States Abstractions
Mapping
Control Commands
Distributes, Configures
Network Info Base
API
Distributed System
Abstraction
Provides
Provides
OpenFlow
21
Switch Forwarding Pipeline
Logical Forwarding Plane
As packets/flows traverse the network: moving both in logical and physical forwarding plane → logical context
22
Data-Plane:SimplePacketHandling
• Simplepacket-handlingrules- Pattern:matchpacketheaderbits- Actions:drop,forward,modify,sendtocontroller- Priority:disambiguateoverlappingpatterns- Counters:#bytesand#packets
1. src=1.2.*.*,dest=3.4.5.*à drop2. src=*.*.*.*,dest=3.4.*.*à forward(2)3.src=10.1.2.3,dest=*.*.*.*à sendtocontroller
23
UnifiesDifferentKindsofBoxes• Router
- Match:longestdestinationIPprefix
- Action:forwardoutalink• Switch
- Match:destinationMACaddress
- Action:forwardorflood
• Firewall- Match:IPaddressesandTCP/UDPportnumbers
- Action:permitordeny• NAT
- Match:IPaddressandport- Action:rewriteaddressandport
24
Controller:Programmability
25
Network OS
Controller Application
Events from switchesTopology changes,Traffic statistics,Arriving packets
Commands to switches(Un)install rules,Query statistics,Send packets
ExampleOpenFlowApplications
• Dynamicaccesscontrol• Seamlessmobility/migration• Serverloadbalancing• Networkvirtualization• Usingmultiplewirelessaccesspoints• Energy-efficientnetworking• Adaptivetrafficmonitoring• Denial-of-Serviceattackdetection
See http://www.openflow.org/videos/26
Example:DynamicAccessControl
• Inspectfirstpacketofaconnection• Consulttheaccesscontrolpolicy• Installrulestoblockorroutetraffic
27
SeamlessMobility/Migration
• Seehostsendtrafficatnewlocation• Modifyrulestoreroutethetraffic
28
ServerLoadBalancing• Pre-installload-balancingpolicy• SplittrafficbasedonsourceIP
29
src=0*
src=1*
NetworkVirtualization
30
Partition the space of packet headers
Controller #1 Controller #2 Controller #3
OpenFlowintheWild
• OpenNetworkingFoundation- Google,Facebook,Microsoft,Yahoo,Verizon,DeutscheTelekom,andmanyothercompanies
• CommercialOpenFlowswitches- HP,NEC,Quanta,Dell,IBM,Juniper,…
• Networkoperatingsystems- NOX,Beacon,Floodlight,Nettle,ONIX,POX,Frenetic
• Networkdeployments- Eightcampuses,andtworesearchbackbonenetworks- Commercialdeployments(e.g.Googlebackbone)
31
ControllerDelayandOverhead
• Controllerismuchslowerthantheswitch• Processingpacketsleadstodelayandoverhead• Needtokeepmostpacketsinthe“fastpath”
32
packets
DistributedController
33
Network OS
Controller Application
Network OS
Controller Application
For scalability and reliability
Partition and replicate state
High-performance Networks• Production and testbed networks
- UltraScience Net- ESnet OSCARS
• Offers MPLS tunnels and VLAN virtual circuits- Internet2 ION
• Offers MPLS tunnels and VLAN virtual circuits- UCLP
• User Controlled Light Paths- CHEETAH
• Circuit-switched High-speed End-to-End Transport Architecture
- DRAGON• Dynamic Resource Allocation via GMPLS Optical
Networks34
UltraScience Net – In a Nutshell• Experimental Network Research Testbed
35
Control Plane
36
Responsible for1. Reserving link
bandwidths2. Setting up end-to-
end network paths3. Releasing
resources when tasks are completed
Bandwidth Scheduling
37
Bandwidth Scheduler
• Central component- Computes paths and allocate link bandwidths
• Scheduling in USN- Fixed slot: start time, end time, target BW
• Extension of Dijkstra’s Algorithm- All slots: duration, target BW
• Extension of Bellman-Ford Algorithm- Both are solvable by polynomial-time algorithms
38
Network Model for Bandwidth Scheduling
• Graph: G(V, E)• Link bandwidth
- Segmented constant functions of time- Time-bandwidth (TB): (t[i], t[i+1], b[i])
• Aggregated TB for all links
39
Res
idua
l Ban
dwid
th
[0]t [2]t[1]t [3]t
4 Types of Scheduling Problems (TON’13)• Given: G = (E, V), ATB, vs , vd , data size δ• Objective: minimize data transfer end time• Fixed Path with Fixed Bandwidth (FPFB)
- Compute a fixed path from vs to vd with a constant (fixed) bandwidth
• Fixed Path with Variable Bandwidth (FPVB)- Compute a fixed path from vs to vd with varying bandwidths
across multiple time slots• Variable Path with Fixed Bandwidth (VPFB)
- Compute a set of paths from vs to vd at different time slots with the same (fixed) bandwidth
• Variable Path with Variable Bandwidth (VPVB)- Compute a set of paths from vs to vd at different time slots with
varying bandwidths across multiple time slots
40
VPFB & VPVB
• Multiple paths are used in a sequential order• Path switching incurs a delay (overhead) τ
- Path switching delay is negligible (τ = 0)• VPFB-0, VPVB-0
- Path switching delay is not negligible (τ > 0)• VPFB-1, VPVB-1
• When τ is large enough- VPFB-1 -> FPFB, VPVB-1 -> FPVB
41
Problem Features• FPFB is the most stringent• VPVB is the most flexible• FPFB and VPFB restrict the BW
- Not always optimal to start data transfer immediately
- Suited for transport methods with fixed rate• FRTP, PLUT, Tsunami, Hurricane, RBUDP
• FPVB and VPVB use variable BW- Always start immediately- Suited for transport methods with dynamically
adapted rate• SABUL, RAPID, RUNAT, improved TCP
42
Complexity and Algorithm
• An optimal algorithm for each problem• A heuristic for each NPC problem
43
Workflow Mapping
44
Cost ModelsA computing workflow consist of modules: .
A computer network modeled an arbitrary directed graph, where computing nodes are
interconnected by directed communication links.
Objectives: map modules to nodes to achieve minimum end-to-end delay (MED) or maximum frame rate (MFR).
0 1 1, ,... mw w w -
, 1, , 1jw j m= -!
( )jw
l × 1jw +1jz -
Computationalcomplexity jz
1jw -
Module
Recv data : Send data :
( , )G V E= | |V n= 0 1 1, , , nv v v -!
kvLink bandwidth: ,h kbhv
Min link delay: ,h kd Node processing power kpNetwork
m
,,
,
( )Message transmission time : , Node computing time : i iw wi j
i ji j
zzd
b pl
+
45
Link failure rate: ,h kf Node Failure rate
Workflow
kf
• Workflow mapping and execution conditions- Single-node mapping
• If w is mapped to v,• Otherwise,
- Module execution precedence
- Data transfer precedence
1,c
wv wv Vx w V
Î
= " Îå
, ,, ,j i j
s fw e j w i j wt t w V e E³ " Î Î
, ,, ,i j i
s fe w i w i j wt t w V e E³ " Î Î
1wvx =0wvx =
46
- Fair node sharing
- Fair link sharing
comp( ) ( , )( , ) , ,w w
w cv
z A w vT w v w V v Vp
l ×= " Î Î
where and( , ) ( )fw
sw
t
vtA w v t dta= ò
:( )( ) 0
( )f s
w w w
v wvw V t t t t
t xaÎ - - ³
= å
, , ,tran , , , , ,
,
( , )( , ) , ,i j i j h ki j h k h k i j w h k c
h k
z B e lT e l d e E l E
b×
= + " Î Î
where and,
,,
, ,( , ) ( )fei j
s h kei j
t
i j h k ltB e l t dtb= ò ,
, ,,:( )( ) 0
( )h k i h j k
f si j w ee i ji j
l w v w ve E t t t t
t x xbÎ - - ³
= ×å
47
• Performance Metrics- End-to-end delay (ED)
1
ED1 2
comp tran ( , )0 0
1[ ]
0 , 1 [ ]
, 1 1 [ ], [ 1][ ], [ 1]
[ ], [ 1]
(CP mapped to a path of nodes)
( ) ( , )
( ) ( ( , ), )
i i i
j j
i
q q
g e g gi i
qw w j P i
i j g j P i
i i i i P i P iP i P i
i P i P i
T P q
T T T T
z A w v
p
z g g B e g g ld
b
l
+
- -
= =
-
= Î ³
+ + ++
= +
= + = +
×æ ö= ç ÷ç ÷
è øæ ö×
+ +ç ÷ç ÷è ø
å å
å å2
0
q-
å
48
- Frame rate: inverse of global bottleneck (BN)
- Overall failure rate
, ,' ', ' ' ', '
BN
'
comp ' '
, ,, , ', 'tran , ', ', ,
', '', '
( mapped to a network )( ) ( , )
( , )max max
( , )( , )
i i
i w j k w i w j k wi c j k i c j k
w c
w w i i
i i i
w V e E w V e Ej k j k j kj k j kv V l Ec v V l Ec
j kj k
T G Gz A w v
T w v pz B e lT e l
db
l
Î Î Î ÎÎ Î Î Î
×æ öç ÷æ ö ç ÷= =ç ÷ç ÷ ç ÷×è ø +ç ÷çè ø
÷
, ,
, ,
,mappedon mappedon
,,
1 (1 ) (1 )i j h k i h
i w h ci j w h k c
h k he l w v
w V v Ve E l E
f fÎ ÎÎ Î
æ ö æ öç ÷ ç ÷= - - × -ç ÷ ç ÷ç ÷ç ÷ è øè ø
Õ ÕF
49
• Objective Functions- Minimum End-to-end Delay (MED)
- Maximum Frame Rate (MFR)
EDall possible mappingsmin ( ), such thatT £ FF
BNall possible mappingsmin ( ), such thatT £F F
50
Pipeline Mapping-- A special case of DAG
• Problem categories- MED
Ø No Node Reuse (MED-NNR)Ø Contiguous Node Reuse (MED-CNR)Ø Arbitrary Node Reuse (MED-ANR)
- MFRØ No Node Reuse or Share (MFR-NNR)Ø Contiguous Node Reuse and
Share (MFR-CNR)Ø Arbitrary Node Reuse and Share
(MFR-ANR)
51Arbi t rary node reuse
Source Destination
No node reuse
Cont i guous node reuse
M0 M1 Mn-5 Mn-4 Mn-2
Mn-1
Mn-3
Vn-2
M2
Vp+1
Vq+1
Vp
Vq
V1
V5
Vs Vd
Source Destination
M0 M1 Mn-5 Mn-4 Mn-2
Mn-1
Mn-3
Vn-2
M2
Vp+1
Vq+1
Vp
Vq
V1
V5
Vs Vd
Source Destination
M0 M1 Mn-5 Mn-4 Mn-2
Mn-1
Mn-3
Vn-2
M2
Vp+1
Vq+1
Vp
Vq
V1
V5
Vs Vd
Problem Category and Complexity
52
NoNodeReuse
ContiguousNodeReuse
ArbitraryNodeReuse
ObjectiveFunctionConstraints
MinimumEnd-to-endDelay
NP-complete
NP-complete
Polynomial(Dyn.Prog.)
MaximumFrameRate
NP-complete
NP-complete
NP-complete
• MED-ANR is polynomially solvable- Dynamic Programming -based solution
• MED/MFR-NNR/CNR are NP-complete- Reduce from DISJOINT-CONNECTING-PATH (DCP)
• MFR-ANR is NP-complete- Reduce from Widest-path with Linear Capacity Constraints
(WLCC)
53
• Mapping algorithm for MED:- Recursive Critical Path for MED with Failure Rate
Constraint
RCP-F
From special to general: DAG Mapping (TC’14)
54
• Mapping algorithm for MFR:• Layer-oriented DP algorithm for MFR with Failure Rate
Constraint- Sort a DAG-structured workflow in a topological order - Map computing modules to network nodes on a layer-by-layer
basis, considering:• Module dependency in the workflow• Network connectivity• Node and link failure rate
LDP-F
( ( ))
,
, , ,, ,
,
, ( ),1to 1, 0 1( ( ))
,
( , )max 1 (1 ) (1 ) (1 ) if
( ) ( , )max
( ) ( , )max
j ju j
iv V pre wmapping j
j j
h u
u j u j h ih i j u h i i
h i
w w j ii j w pre wj m h nv suc v i
h u
w w j h
h
Tz B e l
d f f h ibz A w vT
p
Tz A w v
p
l
l
" Î
Î= - £ £ -
Î
æ öç ÷ç ÷ç ÷×ç ÷+ Ù = - - × - × - £ ¹ç ÷ç ÷
×= ç ÷ç ÷è ø
×
!
F F F
1 (1 ) (1 ) ifj u if h i
æ öç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷ç ÷æ öç ÷ç ÷
Ù = - - × - £ =ç ÷ç ÷ç ÷ç ÷ç ÷è øè ø
F F F
55
The DP table with separated layers in .
Layered workflow in a topological sorting.
vs=v0
vd=vn-1
v1v2v3v4
...
w0 w1 w2 w3 w4 w5 …... ws wm-1…...…...
…...…...
…...
...…...layer 0
vs=v0
vd=vn-1
v1v2v3v4
...
w0 w1 w2 w3 w4 w5 …... ws wm-1…...…...
…...…...
...…...layer 1
vs=v0
vd=vn-1
v1v2v3v4...
w0 w1 w2 w3 w4 w5 …... ws wm-1…...…...
…...…...…...
...…...layer 2
vs=v0
vd=vn-1
v1
vyvpvq
...
w0 w1 w2 w3 wl wt ws wm-1…...…...
…...…...
…...
...
…...layer l-2
vs=v0
vd=vn-1
w0 w1 w2 w3 wswm-1…...…...
…...…...
...
…...layer l-1
…...
T0,0
T1,1T2,2
T2,4T3,3
T4,5
v1
vyvpvq...
…...
Tq,s
Tn-1,m-1
…...
…...
…... wl wt
Ty,lTp,t
LDP-F
vs
v3
v1
vdvq
vx
vp
…...
…...
w0 wm-1
w2
wl
wt…...
…...
…...
Workflow
Computer network
w1
ws
vy
v2
w3 …...
w5
w4
layer 0 layer 1 layer 2 …... layer l-1layer l-2
v4
v5
Jet air flow dynamics (pressure, raycasting)
TSI explosion (density, raycasting)
A Prototype System:Distributed Remote Intelligent Visualization Environment (DRIVE)
56
Two Examples in the Visualization of Large-scale Scientific Applications
A Production System (JOGC’13):Scientific Workflow Automation and Management Platform (SWAMP)
57
58
Use Case: Spallation Neutron Source (SNS)
SNS Workflow (abstract)
59
SNS Workflow (concrete)
60
61
SNS Workflow Results
62