High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor...

12
High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi Date: 2014/12/2 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

Transcript of High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor...

Page 1: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

High-Performance Packet Classification on GPU

Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna

Publisher: HPEC 2014

Presenter: Gang Chi

Date: 2014/12/2

Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.

Page 2: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Introduction (1/2)

This paper investigate GPU’s characteristics in parallelism and memory accessing, and implement our packet classifier using CUDA.

The basic operations of this design are binary range-tree search and bitwise AND operation.

Optimize the design by storing the range-trees using compact arrays without explicit pointers in shared memory.

National Cheng Kung University CSIE Computer & Internet Architecture Lab

2

Page 3: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Introduction (2/2)

When the size of rule set is 512, this design can achieve the throughput of 85 MPPS and the average processing latency of 4.9 us per packet.

Compared with the implementation on the state-of-the-art multi-core platform, this design demonstrates 1.9x improvement with respect to throughput.

National Cheng Kung University CSIE Computer & Internet Architecture Lab

3

Page 4: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

CUDA Memory Model

National Cheng Kung University CSIE Computer & Internet Architecture Lab

4

Type Location Access cycle Size

Global memory Off-chip >100 1~32GB per GPU

L1 cache On-chip 1~32 16 or 48KB per SMX

L2 cache On-chip 1~32 64KB per SMX

Registers On-chip n/a 32-bit x 65536 per SMX

Page 5: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Algorithm

Phase 1: each thread examines N/K rules and produces a local classification result. Phase 2: the rule with the highest priority among the K local results is identified in logK steps.

National Cheng Kung University CSIE Computer & Internet Architecture Lab

5

Page 6: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Pre-process Pre-process rules to construct a binary range-tree for each individual field. Every leaf node is

assigned with BVs, which can infer which rules are matched when reaching the leaf node.

National Cheng Kung University CSIE Computer & Internet Architecture Lab

6

Page 7: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Search Each thread performs binary range-tree

search sequentially field by field. After 5 tree searches, 5 BVs are produced.

Merge the 5 BVs by bitwise AND operation to obtain a final BV.

The result is the index of first non-zero bit.• Ex: BV=00100, Result=2• Ex: BV=00000, Result=65536

National Cheng Kung University CSIE Computer & Internet Architecture Lab

7

Page 8: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Search in Binary Range Tree

National Cheng Kung University CSIE Computer & Internet Architecture Lab

8

Ex: Search 4

Page 9: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Identify Global Result

National Cheng Kung University CSIE Computer & Internet Architecture Lab

9

Page 10: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Experimental Platform

CUDA 5.0 Intel E5-2665 x2

• 2.4 GHz• 8-core

NVIDIA K20 Kepler GPU • 705.5 MHz• 13 SMX with total 2496 CUDA cores• 5GB GDDR5

National Cheng Kung University CSIE Computer & Internet Architecture Lab

10

Page 11: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Latency and Throughput

National Cheng Kung University CSIE Computer & Internet Architecture Lab

11

Page 12: High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.

Comparison with implementation on Multi-core

[8] S. Zhou, Y. Qu and V. K. Prasanna, “Multi-core implementation of decomposition-based packet classification algorithms,” in Parallel Computing Techniques (PaCT), pp. 105-119, 2013.

National Cheng Kung University CSIE Computer & Internet Architecture Lab

12