Implementing a layer 2 framework on linux network

Post on 04-Dec-2014

2.952 views 6 download

description

 

Transcript of Implementing a layer 2 framework on linux network

Takuya ASADA<syuu@dokukino.com> @syuu1228

I was in embedded software company, worked on SMP support for router firmware

Ph. D. Student of Tokyo University of Technology, researching improvement network I/O architecture on modern x86 servers

Interested in: SMP, Network, Virtualization

GSoC ’11(FreeBSD) Multithread support for BPF

GSoC ’12(FreeBSD) BIOS support for BHyVe

Research assistant at IIJ research laboratory, implementing BCube for Linux

Today’s topic!

BCube is a new network architecture

Designed for shipping-container based modular data centers

Server-centric network structure ◦ Server act as

End hosts

Relay nodes for each other

The paper published in ACM SIGCOMM ’09 by Microsoft Research Asia

Each server has one connection to each layers

Switches never connect to other switches

Servers relay traffic for each other

switch

server

000 001

0,0

010 011

0,1

1,0 1,1

100 101

0,0

110 111

0,1

1,0 1,1

2,0 2,1 2,0 2,1

Bcube0

Bcube1

Bcube2

𝐵𝐶𝑢𝑏𝑒𝑘 has k + 1 layers

𝐵𝐶𝑢𝑏𝑒𝑥 contains n 𝐵𝐶𝑢𝑏𝑒𝑥−1

𝐵𝐶𝑢𝑏𝑒0 contains n servers

Total servers = 𝑛𝑘+1

000 001

0,0

010 011

0,1

1,0 1,1

100 101

0,0

110 111

0,1

1,0 1,1

2,0 2,1 2,0 2,1

Bcube0

Bcube1

Bcube2

switch

server

High network capacity for various traffic patterns ◦ one-to-one

◦ one-to-all

◦ one-to-several

◦ all-to-all

Performance degrades gracefully as servers/switches failure increases

Doesn’t need special hardware, only use commodity switch

Each server has unique BCube address

Each digit pointed port number of switch in the layer

000 001

0,0

010 011

0,1

1,0 1,1

100 101

0,0

110 111

0,1

1,0 1,1

2,0 2,1 2,0 2,1

Bcube0

Bcube1

Bcube2

switch

server

Default routing rule ◦ Top layer→Bottom layer

◦ Ex: Route from 000 to 111 000 →100 →110 →111

000 001

0,0

010 011

0,1

1,0 1,1

100 101

0,0

110 111

0,1

1,0 1,1

2,0 2,1 2,0 2,1

Bcube0

Bcube1

Bcube2

There are alternate routes between any nodes

Can bypass failure servers and switches

Also can use acceralate throughput to parallelize traffic

000 001

0,0

010 011

0,1

1,0 1,1

100 101

0,0

110 111

0,1

1,0 1,1

2,0 2,1 2,0 2,1

Bcube0

Bcube1

Bcube2

Source server decides the best path for a flow

Bypass failure paths

To propagate routing path, source server writes routing path information on packet header

Add BCube header between Ethernet header and IP header

Has src/dst address and also routing path information on “Next Hop Index Array”

IP Header

BCube Header

Ethernet HeaderBCube dest address

BCube source address

Protocol type

Next Hop Index Array

Evaluating various "Data Center Network" technologies, especially for container-moduler datacenter architecture. BCube is one of the candidate.

Try to use existing code as much as possible

Minimum implementation at first

BCube binds multiple interface, assigns a BCube address and an IP address

What is the most similar function which already existing on Linux? →Bridge! ◦ Forked bridge.ko and brctl command,

named bcube.ko and bcctl command

brctl addbr <bridge> brctl delbr <bridge> ↓ bcctl addbc <bcube> <bcaddr> <N> <K> bcctl delbc <bcube>

Modified addbr/delbr, add 3 args ◦ BCube address ◦ n and k parameter

Use MAC address format/size for BCube address

Use BCube address for HW address of BCube device ◦ It works like fake MAC address on Linux network stack

101 → 00:00:01:00:01

brctl addif <bridge> <device> brctl delif <bridge> <device>

↓ bcctl assignif <bcube> <layer> <device> bcctl unassignif <bcube> <layer> <device>

Modified assignif / unassignif command, add layer number on args

Need to reconsider address resolution

Normal Ethernet ◦ IP Address → MAC Address (ARP)

BCube network ◦ IP Address → BCube Address

→ ARP?

◦ (Neighbor) BCube address → MAC Address → Need additional neighbor discovery protocol

Once broadcast works on BCube implementation, ARP should work on it

But I haven’t implemented it yet, decided to configure manually by following command: arp –i bc0 –s 10.0.0.6 00:00:00:01:00:10

Need an ARP like protocol

Decided to configure manually too, implemented following command: bcctl addneighbour <bcube> <layer> <bcaddr> <macaddr> bcctl delneighbour <bcube> <layer> <bcaddr>

bcube.ko maintenance neighbor table, use it in packet transmitting/forwarding

In bridge.ko, it maintenance FDB(forwarding database) to lookup destination MAC address→output port using hash table

Deleted FDB, implemented function to decide next hop BCube address, output port, and MAC address of next hop

Haven’t implemented source routing – just default routing for now

Top layer→Bottom layer

Ex: Route from 000 to 111 000 →100 →110 →111

000 001

0,0

010 011

0,1

1,0 1,1

100 101

0,0

110 111

0,1

1,0 1,1

2,0 2,1 2,0 2,1

Bcube0

Bcube1

Bcube2

To add BCube Header between Ethernet Header and IP header, I forked net/ethernet/eth.c

ETH_HLEN (14byte) → BCUBE_HLEN (24byte)

struct ethhdr (MAC header) → struct bcubehdr (MAC & BCube header)

eth_header_ops → bc_header_ops To handle Bcube Header

Unfortunately GRO accesses ethernet header directly, and it works before BCube handles a packet – need to disable it

Found a way to implement new L2 framework using existing bridge implementation ◦ Lot more easy than implement it from scrach

Development Status ◦ Implemented basic features, debugging now ◦ Will consider to add more features

broadcast / multicast Intermediate node/switch failure detection, change the

routing source routing address resolution protocol

Planing more detail evaluation in our data center testbed

Any comments and suggestions are welcome

This work was done as part of research assistance work at IIJ research laboratory.