Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
-
Upload
hadoop-summit -
Category
Technology
-
view
418 -
download
2
Transcript of Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
1
2016/10/27
Kai Fukazawa, Yahoo Japan Corporation
Network for the Large-scale
Hadoop cluster at Yahoo! JAPAN
Agenda
2
Hadoop and Related NetworkYahoo! JAPAN’s Hadoop Network TransitionNetwork Related Problems and Solutions
Network Related Problems Network Requirements of The Latest Cluster Adopted IP CLOS Network for Solving Problems
Yahoo! JAPAN’s IP CLOS Network Architecture Performance Tests New Problems
Future Plan
Hadoop and Related Network
Hadoop and Related Network
4
Hadoop has various communication events Heartbeat
Reports (Job/Block/Resource)
Block Data Transfer
“HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016).
“Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).
Hadoop and Related Network
5
Hadoop has various communication events Heartbeat
Reports (Job/Block/Resource)
Block Data Transfer
“HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016).
“Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).
Hadoop and Related Network
6
Hadoop has various communication events Heartbeat
Reports (Job/Block/Resource)
Block Data Transfer
North/South
Hadoop and Related Network
7
Hadoop has various communication events Heartbeat
Reports (Job/Block/Resource)
Block Data Transfer
East/West
Hadoop and Related Network
8
Hadoop has various communication events Heartbeat
Reports (Job/Block/Resource)
Block Data Transfer
HighLow
Hadoop and Related Network
9
“Introduction to Facebook‘s data center fabric”. (11/14/2014). Retrieved https://www.youtube.com/watch?v=mLEawo6OzFM. (10/06/2016).
Hadoop and Related Network
10
Oversubscription commonly expressed as a ratio of the amount of desired bandwidth required
versus bandwidth available
10Gbps
1Gbps NIC 40Nodes = 40Gbps
Oversubscription40 : 10 = 4 : 1
“Hadoop Operations by Eric Sammer (O’Reilly). Copyright 2012 Eric Sammer, 978-1-449-32705-7.”
Yahoo! JAPAN’s
Hadoop Network Transition
12
Yahoo! JAPAN’s Hadoop Network Transition
Cluster1(Jun. 2011)
Cluster2(Jan. 2013)
Cluster3(Apr. 2014)
Cluster4(Dec. 2015)
Cluster5(Jun. 2016)
01020304050607080
Cluster VolumeP
B
13
Yahoo! JAPAN’s Hadoop Network Transition
Cluster1
Stack ArchitectureNodes/RackServer NICUpLinkOversubscription
14
Yahoo! JAPAN’s Hadoop Network Transition
20G
Cluster1
4 Switches/Stack
Stack ArchitectureNodes/RackServer NICUpLinkOversubscription
15
Yahoo! JAPAN’s Hadoop Network Transition
Cluster1
Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLinkOversubscription
16
Yahoo! JAPAN’s Hadoop Network Transition
Cluster1
Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLinkOversubscription
17
Yahoo! JAPAN’s Hadoop Network Transition
Cluster1
Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription
20Gbps
18
Yahoo! JAPAN’s Hadoop Network Transition
20Gbps
Cluster1
Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 4.5 : 1
19
Yahoo! JAPAN’s Hadoop Network Transition
20Gbps
Cluster1
Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 4.5 : 1
Up to ~10 switches
20
…
Cluster2
Yahoo! JAPAN’s Hadoop Network Transition
Spanning Tree ProtocolNodes/RackServer NICUpLinkOversubscription
21
…
Cluster2
Yahoo! JAPAN’s Hadoop Network Transition
Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription
22
Yahoo! JAPAN’s Hadoop Network Transition
…
Cluster2
Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription
23
Yahoo! JAPAN’s Hadoop Network Transition
…
Cluster2
Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription10Gbps
24
Yahoo! JAPAN’s Hadoop Network Transition
…
Cluster2
Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription 4 : 110Gbps
25
Yahoo! JAPAN’s Hadoop Network Transition
…
Cluster2
Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription 4 : 1Blocking
26
L2 Fabric
…
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/RackServer NICUpLinkOversubscription
Cluster3
27
L2 Fabric
…
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription
Cluster3
28
L2 Fabric
…
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription
Cluster3
29
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLink 20GbpsOversubscription
L2 Fabric
…
Cluster3
20Gbps 20Gbps
30
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 2 : 1
L2 Fabric
…
Cluster3
20Gbps 20Gbps
31
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/RackServer NICUpLinkOversubscription
L2 Fabric
…
Cluster4
32
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/Rack 16NodesServer NIC 10GbpsUpLinkOversubscription
L2 Fabric
…
Cluster4
33
Yahoo! JAPAN’s Hadoop Network Transition
L2 Fabric/ChannelNodes/Rack 16NodesServer NIC 10GbpsUpLink 80GbpsOversubscription 2 : 1
L2 Fabric
…
80Gbps 80Gbps
Cluster4
34
Yahoo! JAPAN’s Hadoop Network transition
Release Volume #Nodes/Switch NIC Oversubscription
Cluster1 3PByte 90 1Gbps 4.5:1
Cluster2 20PByte 40 1Gbps 4:1
Cluster3 38PByte 40 1Gbps 2:1
Cluster4 58PByte 16 10Gbps 2:1
Cluster5 75PByte ? ?Gbps ?:?
Network Related Problems
And Solutions
Network Related Problems
36
Effect of switch failure in the Stack Architecture
Load on the switch due to BUM Traffic
Limitations for the DataNode Decommission
Limitations for the Scale-out
37
Effect of switch failure in the Stack Architecture
One of the switches which formed the Stack failed
This affected the other switches forming the same Stack
Communication interruption among 90 nodes(5 racks)
insufficient computing resources and processing stoppage
Network Related Problems
38
Load on the switch due to BUM Traffic
L2 Fabric
… …4400Nodes
Due to ARP traffic from servers, load on the core switch CPU increases
Tuning of ARP Cache entry timeout
The problem is Large Network Address
Network Related Problems
39
Limitations for the DataNode Decommission
Network Related Problems
Consideration of the impact on jobs
Limiting the number of nodes for Decommissioning
40
Limitations for the Scale-out
Stack Architecture Up to ~10 switches
L2 Fabric Architecture Depending on the number of
chassis
Network Related Problems
41
Requirements 120~200 RacksScale-out possible up to 10000 Nodes 100~200Gbps UpLink/Rack
10Gbps NIC Server20Nodes/Rack
DataCenter Located in US
Network Requirements of The Latest Cluster
42
How to solve these problems?
43
How to solve these problems?
We adopted IP CLOS Network!
Adopted IP CLOS Network For Solving Problems
44
Google, Facebook, Amazon, Yahoo…Over The Top have adopted DC network architecture
“Introducing data center fabric, the next-generation Facebook data center network”. Facebook Code. https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/. (10/06/2016).
Adopted IP CLOS Network For Solving Problems
45
Improved scalability
Improved high availability
Cope-Up with increase in East-West traffic
Reduction in operating cost
Yahoo! JAPAN’sIP CLOS Network
47
BoxSwitch Architecture No limitation on Scale-out Requires many switches
・・・・・ ・・・
・・ ・・・・・ ・・・
・・
・・ ・・ ・・ ・・・・・
Spine
Leaf
ToR
Architecture
48
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
Architecture
Architecture
49
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
・・・・・Spine
Leaf
Why was this architecture adopted? Reduce in items to be managed
IP address and cable, Interface, BGP Neighbor….. Overcomes the physical constraints, such
as one floor limit Reduction in cost
Architecture
ECMP
Between Spine and Leaf is BGP
51
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
BGP
Architecture
52
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
/31
/26 /27
ArchitectureBetween Spine and Leaf : /31Rack : /26, /27
53
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
/31
/26 /27
Architecture
Resolved the “BUM Traffic problem”
54
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
Leaf Uplink 40Gbps x 4 = 160Gbps
160Gbps① ②
③④
Architecture
55
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
Leaf Uplink 40Gbps x 4 = 160Gbps
① ②③
④
Architecture
10Gbps NIC20Nodes
160Gbps
56
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
Leaf Uplink 40Gbps x 4 = 160Gbps
160G① ②
③④
Architecture
200 : 160 = 1.25 : 1
10Gbps NIC20Nodes
57
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
Leaf Uplink 40Gbps x 4 = 160Gbps
160G① ②
③④
Architecture
200 : 160 = 1.25 : 1Resolved the “Limitations for the
DataNode Decommission”
10Gbps NIC20Nodes
58
・・・・・
Internet
Spine
Core
Router
Layer3Layer2・・・・・
Leaf
Leaf Uplink 40Gbps x 4 = 160Gbps
160G① ②
③④
Architecture
200 : 160 = 1.25 : 1Improved High Availability
10Gbps NIC20Nodes
Architecture
59
Effect of switch failure in the Stack Architecture
Load on the switch due to BUM Traffic
Limitations for the DataNode Decommission
Limitations for the Scale-out
Architecture
60
Effect of switch failure in the Stack Architecture
Load on the switch due to BUM Traffic
Limitations for the DataNode Decommission
Limitations for the Scale-out
✔✔✔
Limited Resolved
61
Yahoo! JAPAN’s Hadoop Network transition
Release Volume #Nodes/Switch NIC Oversubscription
Cluster1 3PByte 90 1Gbps 4.5:1
Cluster2 20PByte 40 1Gbps 4:1
Cluster3 38PByte 40 1Gbps 2:1
Cluster4 58PByte 16 10Gbps 2:1
Cluster5 75PByte 20 10Gbps 1.25:1
Performance Tests(5TB Terasort)
62
63
Performance Tests(40TB DistCp)
64
Performance Tests(40TB DistCp)
16Nodes/Rack8Gbps/Node
65
Performance Tests(40TB DistCp)
16Nodes/Rack8Gbps/NodeAbout 30Gbps x 4 =
120Gbps
New Problems
66
Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay
Slow
New Problems
67
Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay
“org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror”
Slow
New Problems
68
Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay
“org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror”
Slow
New Problems
69
IP changes when the server rack changes Also has a network address for each rack Access control using IP address
Requires ACL update according to relocation
192.168.0.0/26 192.168.0.64/26
192.168.0.10 192.168.0.100
Future Plan
Future Plan
71
Detecting error packet failure before affecting the data transfer
Error!
Future Plan
72
Error!
Auto Shutdown
Detecting error packet failure before affecting the data transfer
Future Plan
73
Use Erasure Coding striping
64kBOriginal raw data
Future Plan
74
Use Erasure Coding
D6
striping64kBOriginal raw data
Raw dataD5D4D3D2D1
Future Plan
75
Use Erasure Coding
D6
striping64kBOriginal raw data
Parity
Raw dataD5D4D3D2D1
P3P2P1
Future Plan
76
Use Erasure Coding
D6
striping64kBOriginal raw data
Parity
Raw dataD5D4D3D2D1
P3P2P1
D6D5
D4D3
D2D1 P1
P2P3
Future Plan
77
Use Erasure Coding
D6
striping64kBOriginal raw data
Parity
Raw dataD5D4D3D2D1
P3P2P1
D6D5
D4D3
D2D1 P1
P2P3
Read
Future Plan
78
Use Erasure Coding
D6
striping64kBOriginal raw data
Parity
Raw dataD5D4D3D2D1
P3P2P1
D6D5
D4D3
D2D1 P1
P2P3
Read
Future Plan
79
Use Erasure Coding
D6
striping64kBOriginal raw data
Parity
Raw dataD5D4D3D2D1
P3P2P1
D6D5
D4D3
D2D1 P1
P2P3
Low Data Locality
Future Plan
80
・・・・・・・・・・・・
Interconnecting various platforms
… …
BOTTLENECK
Future Plan
81
・・・・・・・・・・・・・・
Isolation of computing and storage
: Storage Machine
: Computing Machine
Thank You for Listening!
Appendix
Appendix
84
JANOG38http://www.janog.gr.jp/meeting/janog38/program/clos