Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

84
2016/10/27 1 Kai Fukazawa, Yahoo Japan Corporation Network for the Large- scale Hadoop cluster at Yahoo! JAPAN

Transcript of Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Page 1: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

1

2016/10/27

Kai Fukazawa, Yahoo Japan Corporation

Network for the Large-scale

Hadoop cluster at Yahoo! JAPAN

Page 2: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Agenda

2

Hadoop and Related NetworkYahoo! JAPAN’s Hadoop Network TransitionNetwork Related Problems and Solutions

Network Related Problems Network Requirements of The Latest Cluster Adopted IP CLOS Network for Solving Problems

Yahoo! JAPAN’s IP CLOS Network Architecture Performance Tests New Problems

Future Plan

Page 3: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

Page 4: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

4

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

“HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016).

“Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).

Page 5: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

5

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

“HDFS Architecture“. Apache Hadoop. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. (10/06/2016).

“Google I/O 2011: App Engine MapReduce”. (05/11/2011). Retrieved https://www.youtube.com/watch?v=EIxelKcyCC0. (10/06/2016).

Page 6: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

6

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

North/South

Page 7: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

7

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

East/West

Page 8: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

8

Hadoop has various communication events Heartbeat

Reports (Job/Block/Resource)

Block Data Transfer

HighLow

Page 9: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

9

“Introduction to Facebook‘s data center fabric”. (11/14/2014). Retrieved https://www.youtube.com/watch?v=mLEawo6OzFM. (10/06/2016).

Page 10: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Hadoop and Related Network

10

Oversubscription commonly expressed as a ratio of the amount of desired bandwidth required

versus bandwidth available

10Gbps

1Gbps NIC 40Nodes = 40Gbps

Oversubscription40 : 10 = 4 : 1

“Hadoop Operations by Eric Sammer (O’Reilly). Copyright 2012 Eric Sammer, 978-1-449-32705-7.”

Page 11: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Yahoo! JAPAN’s

Hadoop Network Transition

Page 12: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

12

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1(Jun. 2011)

Cluster2(Jan. 2013)

Cluster3(Apr. 2014)

Cluster4(Dec. 2015)

Cluster5(Jun. 2016)

01020304050607080

Cluster VolumeP

B

Page 13: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

13

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/RackServer NICUpLinkOversubscription

Page 14: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

14

Yahoo! JAPAN’s Hadoop Network Transition

20G

Cluster1

4 Switches/Stack

Stack ArchitectureNodes/RackServer NICUpLinkOversubscription

Page 15: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

15

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLinkOversubscription

Page 16: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

16

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLinkOversubscription

Page 17: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

17

Yahoo! JAPAN’s Hadoop Network Transition

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription

20Gbps

Page 18: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

18

Yahoo! JAPAN’s Hadoop Network Transition

20Gbps

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 4.5 : 1

Page 19: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

19

Yahoo! JAPAN’s Hadoop Network Transition

20Gbps

Cluster1

Stack ArchitectureNodes/Rack 90NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 4.5 : 1

Up to ~10 switches

Page 20: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

20

Cluster2

Yahoo! JAPAN’s Hadoop Network Transition

Spanning Tree ProtocolNodes/RackServer NICUpLinkOversubscription

Page 21: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

21

Cluster2

Yahoo! JAPAN’s Hadoop Network Transition

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

Page 22: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

22

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

Page 23: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

23

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription10Gbps

Page 24: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

24

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription 4 : 110Gbps

Page 25: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

25

Yahoo! JAPAN’s Hadoop Network Transition

Cluster2

Spanning Tree ProtocolNodes/Rack 40NodesServer NIC 1GbpsUpLink 10GbpsOversubscription 4 : 1Blocking

Page 26: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

26

L2 Fabric

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/RackServer NICUpLinkOversubscription

Cluster3

Page 27: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

27

L2 Fabric

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

Cluster3

Page 28: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

28

L2 Fabric

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLinkOversubscription

Cluster3

Page 29: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

29

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLink 20GbpsOversubscription

L2 Fabric

Cluster3

20Gbps 20Gbps

Page 30: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

30

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 40NodesServer NIC 1GbpsUpLink 20GbpsOversubscription 2 : 1

L2 Fabric

Cluster3

20Gbps 20Gbps

Page 31: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

31

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/RackServer NICUpLinkOversubscription

L2 Fabric

Cluster4

Page 32: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

32

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 16NodesServer NIC 10GbpsUpLinkOversubscription

L2 Fabric

Cluster4

Page 33: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

33

Yahoo! JAPAN’s Hadoop Network Transition

L2 Fabric/ChannelNodes/Rack 16NodesServer NIC 10GbpsUpLink 80GbpsOversubscription 2 : 1

L2 Fabric

80Gbps 80Gbps

Cluster4

Page 34: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

34

Yahoo! JAPAN’s Hadoop Network transition

Release Volume #Nodes/Switch NIC Oversubscription

Cluster1 3PByte 90 1Gbps 4.5:1

Cluster2 20PByte 40 1Gbps 4:1

Cluster3 38PByte 40 1Gbps 2:1

Cluster4 58PByte 16 10Gbps 2:1

Cluster5 75PByte ? ?Gbps ?:?

Page 35: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Network Related Problems

And Solutions

Page 36: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Network Related Problems

36

Effect of switch failure in the Stack Architecture

Load on the switch due to BUM Traffic

Limitations for the DataNode Decommission

Limitations for the Scale-out

Page 37: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

37

Effect of switch failure in the Stack Architecture

One of the switches which formed the Stack failed

This affected the other switches forming the same Stack

Communication interruption among 90 nodes(5 racks)

insufficient computing resources and processing stoppage

Network Related Problems

Page 38: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

38

Load on the switch due to BUM Traffic

L2 Fabric

… …4400Nodes

Due to ARP traffic from servers, load on the core switch CPU increases

Tuning of ARP Cache entry timeout

The problem is Large Network Address

Network Related Problems

Page 39: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

39

Limitations for the DataNode Decommission

Network Related Problems

Consideration of the impact on jobs

Limiting the number of nodes for Decommissioning

Page 40: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

40

Limitations for the Scale-out

Stack Architecture Up to ~10 switches

L2 Fabric Architecture Depending on the number of

chassis

Network Related Problems

Page 41: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

41

Requirements 120~200 RacksScale-out possible up to 10000 Nodes 100~200Gbps UpLink/Rack

10Gbps NIC Server20Nodes/Rack

DataCenter Located in US

Network Requirements of The Latest Cluster

Page 42: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

42

How to solve these problems?

Page 43: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

43

How to solve these problems?

We adopted IP CLOS Network!

Page 44: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Adopted IP CLOS Network For Solving Problems

44

Google, Facebook, Amazon, Yahoo…Over The Top have adopted                DC network architecture

“Introducing data center fabric, the next-generation Facebook data center network”. Facebook Code. https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/. (10/06/2016).

Page 45: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Adopted IP CLOS Network For Solving Problems

45

Improved scalability

Improved high availability

Cope-Up with increase in East-West traffic

Reduction in operating cost

Page 46: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Yahoo! JAPAN’sIP CLOS Network

Page 47: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

47

BoxSwitch Architecture No limitation on Scale-out Requires many switches

・・・・・ ・・・

・・ ・・・・・ ・・・

・・

・・ ・・ ・・ ・・・・・

Spine

Leaf

ToR

Architecture

Page 48: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

48

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Architecture

Page 49: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Architecture

49

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

・・・・・Spine

Leaf

Page 50: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Why was this architecture adopted? Reduce in items to be managed

IP address and cable, Interface, BGP Neighbor….. Overcomes the physical constraints, such

as one floor limit Reduction in cost

Architecture

Page 51: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

ECMP

Between Spine and Leaf is BGP

51

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

BGP

Architecture

Page 52: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

52

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

/31

/26 /27

ArchitectureBetween Spine and Leaf : /31Rack : /26, /27

Page 53: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

53

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

/31

/26 /27

Architecture

Resolved the “BUM Traffic problem”

Page 54: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

54

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160Gbps① ②

③④

Architecture

Page 55: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

55

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

① ②③

Architecture

10Gbps NIC20Nodes

160Gbps

Page 56: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

56

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160G① ②

③④

Architecture

200 : 160 = 1.25 : 1

10Gbps NIC20Nodes

Page 57: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

57

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160G① ②

③④

Architecture

200 : 160 = 1.25 : 1Resolved the “Limitations for the

DataNode Decommission”

10Gbps NIC20Nodes

Page 58: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

58

・・・・・

Internet

Spine

Core

Router

Layer3Layer2・・・・・

Leaf

Leaf Uplink 40Gbps x 4 = 160Gbps

160G① ②

③④

Architecture

200 : 160 = 1.25 : 1Improved High Availability

10Gbps NIC20Nodes

Page 59: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Architecture

59

Effect of switch failure in the Stack Architecture

Load on the switch due to BUM Traffic

Limitations for the DataNode Decommission

Limitations for the Scale-out

Page 60: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Architecture

60

Effect of switch failure in the Stack Architecture

Load on the switch due to BUM Traffic

Limitations for the DataNode Decommission

Limitations for the Scale-out

✔✔✔

Limited Resolved

Page 61: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

61

Yahoo! JAPAN’s Hadoop Network transition

Release Volume #Nodes/Switch NIC Oversubscription

Cluster1 3PByte 90 1Gbps 4.5:1

Cluster2 20PByte 40 1Gbps 4:1

Cluster3 38PByte 40 1Gbps 2:1

Cluster4 58PByte 16 10Gbps 2:1

Cluster5 75PByte 20 10Gbps 1.25:1

Page 62: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Performance Tests(5TB Terasort)

62

Page 63: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

63

Performance Tests(40TB DistCp)

Page 64: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

64

Performance Tests(40TB DistCp)

16Nodes/Rack8Gbps/Node

Page 65: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

65

Performance Tests(40TB DistCp)

16Nodes/Rack8Gbps/NodeAbout 30Gbps x 4 =

120Gbps

Page 66: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

New Problems

66

Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay

Slow

Page 67: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

New Problems

67

Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay

“org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror”

Slow

Page 68: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

New Problems

68

Delay in data transfer Out of 4, 1 error packet is generated in Uplink That one affected the data transfer delay

“org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror”

Slow

Page 69: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

New Problems

69

IP changes when the server rack changes Also has a network address for each rack Access control using IP address

Requires ACL update according to relocation

192.168.0.0/26 192.168.0.64/26

192.168.0.10 192.168.0.100

Page 70: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

Page 71: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

71

Detecting error packet failure before affecting the data transfer

Error!

Page 72: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

72

Error!

Auto Shutdown

Detecting error packet failure before affecting the data transfer

Page 73: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

73

Use Erasure Coding striping

64kBOriginal raw data

Page 74: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

74

Use Erasure Coding

D6

striping64kBOriginal raw data

Raw dataD5D4D3D2D1

Page 75: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

75

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

Page 76: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

76

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Page 77: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

77

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Read

Page 78: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

78

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Read

Page 79: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

79

Use Erasure Coding

D6

striping64kBOriginal raw data

Parity

Raw dataD5D4D3D2D1

P3P2P1

D6D5

D4D3

D2D1 P1

P2P3

Low Data Locality

Page 80: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

80

・・・・・・・・・・・・

Interconnecting various platforms

… …

BOTTLENECK

Page 81: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Future Plan

81

・・・・・・・・・・・・・・

Isolation of computing and storage

: Storage Machine

: Computing Machine

Page 82: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Thank You for Listening!

Page 83: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Appendix

Page 84: Network for the Large-scale Hadoop cluster at Yahoo! JAPAN

Appendix

84

JANOG38http://www.janog.gr.jp/meeting/janog38/program/clos