PyCoRAMを用いたグラフ処理FPGAアクセラレータ

PyCoRAMを用いたグラフ処理FPGAアクセラレータ

高前田（山崎）伸也，枝元正寛，姚駿，中島康彦奈良先端科学技術大学院大学情報科学研究科

2014年7月28日 15:15-15:45 SWoPP2014@新潟朱鷺メッセ

概要

n メモリ抽象化フレームワークPyCoRAMを用いたダイクストラ法FPGAアクセラレータを開発 l  PyCoRAM

•  HDLとPythonを両用するFPGAアクセラレータ開発フレームワーク

2014-07-28 Shinya T-Y. NAIST 2

FPGA as SoC (System-on-Chip)

n 沢山のパーツを単一FPGA上に集積しSoCとして利用 l  CPUコア

•  Microblaze (ソフトマクロ)

•  Cortex-A9 (ハードマクロ)

l アクセラレータロジック •  普通のRTLでモデリング

–  Verilog HDL, VHDL

•  新しい言語でモデリング –  Bluespec, AutoESL, Chisel, …

l  DDR3 DRAM

l  PCI-Express

l  Ethernet, …


FPGA

CPU HW Acc

DRAM I/F Ether PCI-E

Interconnect

HW Acc

IPコアベースのシステム開発 n  IPコアを開発・追加して繋げばシステム完成J

l 標準的なインターコネクトでIPコア達を接続

l  EDAツールが自動的にインターコネクトと（いくつかの）デバイス依存のインターフェースを生成してくれるから楽ちん

2014-07-28 Shinya T-Y. NAIST 4 Xilinx Platform Studio (XPS)

IP-core List

Interconnect

FPGA

CPU HW Acc

DRAM I/F Ether PCI-E

Interconnect

HW Acc

DRAM

IP-core Instances

どうやってアクセラレータIPを実装するか？ n 普通にHDLでアクセラレータを実装するのは芸が無い

l というかいろいろ面倒で嫌だ！ •  演算とメモリアクセスのスケジューリングロジック

–  ダブルバッファリングとか面倒

•  メモリシステムの制御回路 –  HDLでステートマシンを書くのは面倒だし間違えやすい

•  デバッグが面倒

l でもパイプラインの振る舞いはサイクルレベルで定義したい •  FPGAで性能を出すには高稼働率のパイプラインが重要

–  だから計算ロジックはHDLで書きたい

–  高位合成だとチューンがイマイチ難しい

n 抽象化されたメモリシステムが使えると幸せそう


CoRAMメモリアーキテクチャ

CoRAM [Chung+,FPGA’11] n  FPGAアクセラレータのためのメモリ抽象化

l 高位モデルによるメモリ管理でアクセラレータをポータブルに •  計算カーネルとメモリアクセスの分離

•  ソフトウェアのモデルによるメモリアクセスパターンの記述


HW Kernels (Computing Logics)

CoR

AM

M

emor

y

Read Write

Manage

Control Threads (Memory Access

Pattern)

CoRAM Channel

Read/Write Read/Write

Communication FIFOs (Registers)

Abstracted On-chip Memories

Off-chip Memory

PyCoRAM [Takamaeda+,CARL’13] n ベンダーEDK向けのPythonベースのCoRAM実装

l 計算カーネルのRTL記述とメモリアクセスパターンの Python記述からAXI4 IPコアを自動合成

l 出来上がったIPコアをEDKでポチポチつなげばシステム完成！

n 特徴 l  Pythonでのコントロールスレッド記述

•  Pythonで簡単にメモリアクセスパターンを記述できる –  独自の高位合成コンパイラでPython記述からVerilog HDLのRTLを合成

l  AMBA AXI4インターコネクトのサポート •  Xilinx Platform Studio (XPS)などを用いたIPコアベースの開発を支援

l 計算ロジックの複雑なデザインに対応 •  ハードウェアデザイン解析・生成のためのオープンソースツールキットPyverilogを活用


PyCoRAMマイクロアーキテクチャ


User I/O

User Logic

CoRAM Channel

CoRAM Register

Control Thread

DMAC

CoRAM Memory

DMAC

CoRAM Stream FSM

GPIO Modeled in RTL (Verilog HDL)

Memory Access Pattern

in Python

PyCoRAMマイクロアーキテクチャ


User I/O

User Logic

CoRAM Channel

CoRAM Register

Control Thread

DMAC

CoRAM Memory

DMAC

CoRAM Stream FSM

GPIO Modeled in RTL (Verilog HDL)

Memory Access Pattern

in Python

def calc_sum(times):� ram = CoramMemory(idx=0, datawidth=32, size=1024)� channel = CoramChannel(idx=0, datawidth=32)� addr = 0� sum = 0� for i in range(times):� ram.write(0, addr, 128)� channel.write(addr)� sum += channel.read()� addr += 128 * (32/8)� print(‘sum=’, sum)�calc_sum(8)�

# Transfer (off-chip DRAM to BRAM) # Notification to User-logic # Wait for Notification from User-logic # $display Verilog system task �

0�1�2�3�4�5�6�7�8�9�10�11�

CoramMemory1P�#(� .CORAM_THREAD_NAME("thread_name"),� .CORAM_ID(0),� .CORAM_ADDR_LEN(ADDR_LEN),� .CORAM_DATA_WIDTH(DATA_WIDTH)� )�inst_memory�(.CLK(CLK),� .ADDR(mem_addr),� .D(mem_d),� .WE(mem_we),� .Q(mem_q)� );�

PyCoRAMマイクロアーキテクチャの実装


PyCoRAM IP

AXI4 Interconnect

DRAM Controller FPGA

User I/O

User Logic

CoRAM Channel

CoRAM Register

Control Thread

DMAC

AXI I/F

CoRAM Memory

DMAC

AXI I/F

CoRAM Stream FSM

GPIO

行列積・ステンシル計算[高前田+,ARC2014-01]


Computing Logic (Verilog HDL) Control Thread

(Python)

sum

CoRAM Memory 0

B × +

CoRAM Memory 1

CoRAM Memory 2

Control Logic CoRAM Channel 0

8-stage Multiply Pipeline A

C

check sum +

Computing Logic (Verilog HDL) Control Thread

(Python) CoRAM

Memory 0

d1

CoRAM Memory 2

CoRAM Memory 3

Control Logic CoRAM Channel 0

41-stage Add-Divide

Pipeline

d0

rslt

d2

+ /

+ check sum

CoRAM Memory 1

行列積

9点ステンシル

メモリ性能 n メモリバンド幅利用率：理論最大の約86%を引き出す

n バンド幅律速なアプリには有効利用できそう l 密行列積・ステンシル計算では高い性能・開発効率を達成

•  長いバーストでバンド幅を有効利用

l じゃあレイテンシ律速なアプリではどうなの？


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

4 8 16 32

Ban

dwid

th U

tiliz

atio

n�

SIMD size [byte]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

4 8 16 32 64

Ban

dwid

th U

tiliz

atio

n�

SIMD size [byte]

Atlys (1.2GB/s MAX) ML605 (6.4GB/s MAX)

本発表の目標

n 不規則なメモリアクセスパターンを持つアプリにおける PyCoRAM適用可能性を明らかにする l 規則的なメモリアクセスパターンを持つアプリケーション（行列積・ステンシル）はバンド幅律速

l メモリアクセスレイテンシの影響が大きいアプリで使えるの？ •  まずは実装してみましょう

n 今回の題材：グラフ処理 l 最短経路探索（ダイクストラ法）

•  ボトルネックになりそうな箇所 –  未訪問ノードの管理→距離をキーとした優先度キュー

–  隣接ノード情報（コスト・親ノード）の読み書き→ページング


最短経路探索: ダイクストラ法 (1)


S

b

a

G

c 10

15

20

5

30

5

0

S, 0

Priority Queue Graph



S

b

a

G

c 10

15

20

5

30

5

0

10

15

a, 10


b, 15



S

b

a

G

c 10

15

20

5

30

5

0

10

15

b, 15


c, 30

30



S

b

a

G

c 10

15

20

5

30

5

0

10

15

30 20

45

c, 20


G, 45 c, 30



S

b

a

G

c 10

15

20

5

30

5

0

10

15

30 20

45

G, 25


G, 45 c, 30

25



S

b

a

G

c 10

15

20

5

30

5

0

10

15

30 20

45 25

c, 30

G, 45


データ構造 n ノード（Node）

l  4エントリの構造体 •  コスト・親ノードへのポインタ・エッジ情報テーブルへのポインタ・訪問済みフラグ

n エッジ（Edge Page） l 隣接ノード情報へのポインタを束ねて管理

l  CPUだとキャッシュに乗る・プリフェッチ可

l  FPGAではバースト転送できるようなる


Add

ress

Num Entries Next Page Pointer Neighbor Node Ptr

Cost

Neighbor Node Ptr Cost

Num Entries Next Page Pointer Neighbor Node Ptr

Cost

Neighbor Node Ptr Cost

Edg

e P

age

0 E

dge

Pag

e 1

Current Cost Parent Node Pointer Edge Page Pointer

Visited Flag


Visited Flag


Visited Flag N

ode

0 N

ode

0 N

ode

N-1

ソフトウェアによる実装 n  C言語で実装

l エッジはページ単位で管理：キャッシュに優しい

l 未訪問ノードは優先度付きキュー（バイナリヒープ）で管理


PyCoRAMを用いたダイクストラIPコア n  PyCoRAMを使って演算モジュールはVerilog HDLで実装メモリアクセス制御はPythonで実装


Read Node

InStream

Read Edge

InStream

Update Node

OutStream

Mark Visited

OutStream

Priority Queue

InStream OutStream Edge Page Addr

Cost

+

Next Node Addr

Node Addr

Next Node Cost

Node Addr Next Node Addr

Next Node Cost

FSM

DMAC DMAC DMAC DMAC DMAC DMAC Slave I/F

AXI4 Master Interfaces AXI4-lite

Slave Interfaces

Dijkstra Logic

(Modeled in Verilog HDL)

Mark Visited Cthread

Read Node Cthread

Read Edge Cthread Priority Queue Cthread Mark Visited

Cthread Main

CThread

Control Threads (Modeled in Python)

Use

r Def

initi

on (M

odel

ed in

Ver

ilog

HD

L an

d P

ytho

n)

Gen

erat

ed b

y

PyC

oRA

M

PyCoRAMを用いたダイクストラIPコア n ステージ1: 最小コストノード取り出し


Read Node

InStream

Read Edge

InStream

Update Node

OutStream

Mark Visited

OutStream

Priority Queue


Cost

+

Next Node Addr

Node Addr

Next Node Cost


Next Node Cost

FSM



Slave Interfaces

Dijkstra Logic



Read Node Cthread


Cthread Main

CThread


Use

r Def

initi

on (M

odel

ed in

Ver

ilog

HD

L an

d P

ytho

n)

Gen

erat

ed b

y

PyC

oRA

M

PyCoRAMを用いたダイクストラIPコア n ステージ2：ノード情報読み出し


Read Node

InStream

Read Edge

InStream

Update Node

OutStream

Mark Visited

OutStream

Priority Queue


Cost

+

Next Node Addr

Node Addr

Next Node Cost


Next Node Cost

FSM



Slave Interfaces

Dijkstra Logic



Read Node Cthread


Cthread Main

CThread


Use

r Def

initi

on (M

odel

ed in

Ver

ilog

HD

L an

d P

ytho

n)

Gen

erat

ed b

y

PyC

oRA

M

PyCoRAMを用いたダイクストラIPコア n ステージ3：ノードに訪問済みフラグ書き込み


Read Node

InStream

Read Edge

InStream

Update Node

OutStream

Mark Visited

OutStream

Priority Queue


Cost

+

Next Node Addr

Node Addr

Next Node Cost


Next Node Cost

FSM



Slave Interfaces

Dijkstra Logic



Read Node Cthread


Cthread Main

CThread


Use

r Def

initi

on (M

odel

ed in

Ver

ilog

HD

L an

d P

ytho

n)

Gen

erat

ed b

y

PyC

oRA

M

PyCoRAMを用いたダイクストラIPコア n パイプライン動作：（１）エッジ読み出し→ （２）隣接ノード読み出し→（３）隣接ノード更新


Read Node

InStream

Read Edge

InStream

Update Node

OutStream

Mark Visited

OutStream

Priority Queue


Cost

+

Next Node Addr

Node Addr

Next Node Cost


Next Node Cost

FSM



Slave Interfaces

Dijkstra Logic



Read Node Cthread


Cthread Main

CThread


Use

r Def

initi

on (M

odel

ed in

Ver

ilog

HD

L an

d P

ytho

n)

Gen

erat

ed b

y

PyC

oRA

M

1 2 3

優先度付きキュー（ヒープ） n  CoRAMメモリ x2 + BRAM x1

l 外部から読み出すための CoramInChannel

l 外部へ書き込むための CoramOutChannel

l コストが小さいノード群を格納する BRAM


Control Thread

(Modeled in Python

FSM Channel FIFO

Out

DMAC DMAC

Priority Queue Logic (Modeled in Verilog HDL)

Memory Bus (To DRAM)

DMA requests

Parent Left Right Child

Compare Logic

BRAM In

d, 20

f, 45 b, 30

a, 40 e, 50 BRAM Zone

評価 n  FPGAボード実機で評価

l ボード: Digilent Atlys •  FPGA: Spartan-6 LX45

•  DRAM: DDR2-800 (1.6GB/s), 128MB

l ツール: Xilinx PlanAhead 14.7, XPS 14.7

n 汎用PC上と比較 l  Intel Core i7 3770K (3.5GHz), DDR3-1600 (12.8GB/s ×2)

l  Linux (Ubuntu 14.04), gcc 4.8.2 (-O3)

n グラフ l  XORSHIFT乱数を用いてランダムに生成

l ノード数: 5000，エッジ数: 100000 •  より大規模なグラフはデバッグが間に合わなかったので今後の課題ということで･･･


評価環境：FPGAシステム n ホストからUART経由で制御・グラフ構築にMicroblaze


Dijkstra Accelerator

CoRAM Abstraction

Read Node

CThread

Dijkstra Logic

Read Node

Read Edge

Update Node

Mark Visited

Read Edge

CThread

Update Node

CThread

Mark Visited

CThread

Main Control Tread

Priority Queue

Priority Queue

CThread

UART Loader

CoRAM Abstraction

UART Loader Logic

Control Thread

Microblaze 3-stage

16KB Local memory 2KB I-Cache 2KB D-Cache

AXI4 Interconnect (128-bit, Crossbar)

DRAM Controller (DDR2-800 16-bit (1.6GB/s))

AXI4-lite Interconnect (32-bit, Shared bus)

Host PC

FPGA (Spartan-6 LX45)

DRAM (128MB)

実装の詳細 n  AXI4インターコネクト：4構成

l クロスバー2種：パイプラインレジスタ等を持つ高性能タイプ •  C128: 128ビット幅

•  C32: 32ビット幅

l 共有バス2種：リソース使用量を優先した省エリアタイプ •  S128: 128ビット幅

•  S32: 32ビット幅

n  AXI4バスでは異なるマスターポート間では Read/WriteのIn-order順番が保証されていない l 先にバスにリクエストが発行したからといって必ず先に処理されるわけではない

l 特に Write → Read の依存関係には注意が必要 l 解決策

•  AXI4バスの設定で書き込みポートのPriorityを高くする


実行時間 n  FPGA上の実装は汎用PCと比べて25倍程度低速L

l メモリバンド幅あたりの性能でも1.5倍程度悪い･･･ l なぜか？

•  データセットが小さい・OoOプロセッサのMLP抽出能力は凄い


498.9 492.7

413.4 404.3

16.2 0.0

100.0

200.0

300.0

400.0

500.0

600.0

C128 C32 S128 S32 x86

Exe

cutio

n Ti

me

[mse

c]�

25x

実行時間

n  FPGAでの実行時間を比べてみると直感と真逆の結果 l クロスバーよりも共有バスの方が高性能！

l バス幅が狭い方が高性能！

n なぜか？ l 共有バスの方がレイテンシが短い

l バス幅が短い方がレイテンシが短い

l 必要なモノ：高バンド幅ではなく短レイテンシ


498.9 492.7

413.4 404.3

16.2 0.0

100.0

200.0

300.0

400.0

500.0

600.0

C128 C32 S128 S32 x86

Exe

cutio

n Ti

me

[mse

c]�

FPGAリソース使用量


0 1000 2000 3000 4000 5000 6000 7000 8000

C128 C32 S128 S32

# O

ccup

ied

Slic

es�

Dijkstra Loader CPU Peripheral Interconnect

0

2000

4000

6000

8000

10000

12000

14000

C128 C32 S128 S32

# O

ccup

ied

Reg

s�


0

5000

10000

15000

20000

C128 C32 S128 S32

# O

ccup

ied

LUTs�


0

1000

2000

3000

4000

5000

6000

Dijkstra Reg Dijkstra LUT Loader Reg Loader LUT

# O

ccup

ied

Res

ourc

es� DMAC

Control Thread

User Logic

まとめ

n メモリ抽象化フレームワークPyCoRAMを用いたダイクストラ法FPGAアクセラレータを開発 l  PyCoRAM

•  HDLとPythonを両用するFPGAアクセラレータ開発フレームワーク

l 汎用CPUと比べて25倍低速・バンド幅あたりの性能で1.5倍悪い

l  4種類のインターコネクトの構成で評価 •  どうやらスループットよりもレイテンシが重要

n ツール・フレームワークはgithubにて公開中 l  PyCoRAM: http://shtaxxx.github.io/PyCoRAM/

l  Pyverilog: http://shtaxxx.github.io/Pyverilog/


PyCoRAMを用いたグラフ処理FPGAアクセラレータ

Technology

Transcript of PyCoRAMを用いたグラフ処理FPGAアクセラレータ