資料2-2 最新のHPC技術を生かしたAI・ビッグデータインフラの ... · 2019. 11....

最新のHPC技術を生かしたAI・ビッグデータインフラの東工大TSUBAME3.0

東京工業大学学術国際情報センター (GSIC) 教授

産業技術総合研究所人工知能研究センター (AIRC) 特定フェロー

産総研・東工大リアルワールドビッグデータオープンイノベーションラボラボ長

松岡聡

文部科学省プレゼン資料

2017/4/26

t-tsukamoto

テキストボックス

資料２－２

ACM Gordon Bell Prize 2011ACM ゴードンベル賞2.0ペタフロップス達成

(京コンピュータと同時受賞)

TSUBAME2とシミュレーションアプリケーションのACMゴードンベル賞受賞

Special Achievements in Scalability and Time-to-Solution“Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer” 2

大規模なビッグデータ・AI処理の二つの計算特性ビッグデータ・AIとして

グラフ解析：ソーシャルネットワークソート・ハッシュ：DB・ログデータ解析

記号処理：従来型の人工知能

HPCとして整数・疎行列計算

データ移動中心・大規模メモリ疎でかつランダムなデータ

ビッグデータ・AIとして密行列演算：深層学習

推論・学習・生成

HPCとして密行列計算ただし縮退精度中心

密で比較的整列されたネットワークやデータ

高速化・大規模化高速化・大規模化スパコンによる加速ただし、両者への適合

計算特性がお互い全く異なるが、スパコンにおけるシミュレーション計算も本質的にはどちらか

3

(ビッグデータの) BYTES 能力が、特にデータのバンド幅と容量がBD/AIの加速には大変重要だが、近年の(Linpack)FLOPS偏重のスパコンには多くが欠けている

• システム全体のバンド幅(BYTE/S)と容量(BYTES)がHPC-BD/AI統合には必要(メモリバンド幅だけではNG)

• 左側の疎なデータ中心のアプリには自明

• 右側のDNNでも、高速化のためのマルチノードの並列化で強スケーリングする場合や、CTや地層などの3次元の大規模化するデータ解析を行う場合はやはり必要

(Source: https://www.spineuniverse.com/image-library/anterior-3d-ct-scan-progressive-kyphoscoliosis)

(Source: http://www.dgi.com/images/cvmain_overview/CV4DOverview_Model_001.jpg)

CaffeNetのTSUBAME-KFC/DLにおける学習時の一回のイ

テレーションの時間(Mini-batch size 256)

殆どの時間が通信やメモリ内の捜査に使用

Number of nodes

殆どが通信時間GPU上での計算は僅か3.9%

BD/AIとHPCを統合するスパコンでは、メモリやネットワークの容量や性能を確保するアーキやそのシステムソフトが大変重要 4

ビッグデータ・AI時代のBYTES中心スパコン研究開発要件-東工大GSIC等・産総研AIRC、およびジョイントラボRWBC-OILでの活動-

Bytes-Centric Supercomputer

• BYTES中心スパコンの研究開発：倍精度FLOPS増加脱却 (TSUBAME3)• ペタ～エクサバイトの大規模データがマシン内で自由に移動

• 最新の不揮発性メモリを用いた「ファットノード」の深いメモリ階層と、テラビット級超高速ネットワークによるノード間結合

• BYTES中心スパコン上の整数・疎行列計算アルゴリズム・ソフト

• BYTES中心スパコン上の深層学習アルゴリズムのスケーリング・加速

• シミュレーションサイエンスと加速されたデータサイエンスの統合

• HPC加速されたBD/AI研究を行う研究組織体(例：東工大・産総研OIL)• 我が国のＡＩ基盤として産官学連携・民間技術移転(TSUBAME3->ABCI)

• 世界トップクラスのHPC-AIクラウド基盤を開発5

JST-CREST “Extreme Big Data” Project (2013-2018)

SupercomputersCompute&Batch-Oriented

More fragile

Cloud IDCVery low BW & EfficiencyHighly available, resilient

Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW

PCB

TSV Interposer

High Powered Main CPU

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

Low Power CPU

DRAMDRAMDRAMNVM/Flash

NVM/Flash

NVM/Flash

2Tbps HBM4~6HBM Channels1.5TB/s DRAM & NVM BW

30PB/s I/O BW Possible1 Yottabyte / Year

EBD System Softwareincl. EBD Object System

Large Scale Metagenomics

Massive Sensors and Data Assimilation in Weather Prediction

Ultra Large Scale Graphs and Social Infrastructures

Exascale Big Data HPC

Co-Design

FLOPS中心からBYTES中心のHPCとBD/AIの統合

Graph Store

EBD BagCo-Design

KVS

KVS

KVS

EBD KVS

Cartesian PlaneCo-Design

Given a top-class supercomputer, how fast can we accelerate next generation big data c.f. conventional Clouds?

Issues regarding Architecture, algorithms, system software in co-design

Performance Model?Accelerators?

6

TSUBAME-KFC/DL: ウルトラグリーン・スパコン研究設備（文部科学省概算要求・2011-2015・約2億円)→機械学習スパコンへUG

高温冷却系冷媒油 35~45℃⇒ 水 25~35℃

(TSUBAME2は7~17℃)

冷却塔：水 25~35℃

⇒ 自然大気へ

液浸冷却＋高温大気冷却＋高密度実装＋電力制御のスパコン技術を統合

TSUBAME3.0のプロトタイプ

コンテナ型研究設備20フィートコンテナ(16m2)

無人自動制御運転

将来はエネルギー回収も

高密度実装・油浸冷却210TFlops (倍精度)630TFlops (単精度)

1ラック

2013年11月/2014年6月Green500ランキング世界一位(日本初)

2015アップグレード500TFlops (倍精度)

機械学習等1.5PFlops (単精度)世界最高性能密度

(やはり1ラック=>7ラックで京相当)1ラック60KWの高密度発熱 7

TSUBAME3.0

2006年 TSUBAME1.080テラフロップス・アジア一位

世界7位

2010年 TSUBAME2.02.4ペタフロップス・世界4位運用スパコングリーン世界一

2013年TSUBAME2.5アップグレード(補正予算)

5.7ペタフロップス

2013年TSUBAME-KFCスパコングリーン世界一

2017年 TSUBAME3.012.1ペタフロップス

HPCIリーディングスパコンHPC/BD/AIの統合

大規模学術及び

産業利用アプリ2011年ACMゴードンベル賞

2017年リーディングスパコンTSUBAME3.0へ「最先端の技術チャレンジに挑むスパコン」

1. HPCIリーディングスパコン：TSUBAME2と合算し京の二倍の性能：15-20ペタフロップス、4-5ペタバイト/秒メモリバンド幅、ペタビット級光ネットワーク

2. BD/AI Data Centric Architecture 大規模シリコンストレージによりマルチテラバイト/秒のI/O、機械学習・AI含むビッグデータ用ミドルウェアやアプリ

3. グリーンスパコン10ギガフロップス/W以上の性能電力性能・グリーン・PUE < 1.05

4. クラウドスパコン：種々のクラウドサービスと高速性の両立

5. 「見える化」スパコン：超複雑なマシンの状態が「見える」

8

http://www.new.facebook.com/album.php?profile&id=20531316728

http://www.new.facebook.com/album.php?profile&id=20531316728

TSUBAME 3.0 のシステム概要全GPU・全ノード・全メモリ階層での

BYTES中心スケーラブルアーキテクチャ

フルバイセクションバンド幅のインテル® Omni-Path® 光ネットワーク

432 Terabits/秒双方向全インターネット平均通信量の2倍

DDNのストレージシステム（並列FS 15.9PB+ Home 45TB）

５４０の計算ノード SGI ICE® XAインテル®Xeon® CPU×２＋NVIDIA Pascal GPU×４

256GBメモリ、２TBのNVMe対応インテル®SSD47.2 AI-Petaflops, 12.1 Petaflops

2017年8月本稼働

9

TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCIntra-node GPU via NVLink

20~40GB/sIntra-node GPU via NVLink

20~40GB/s

Inter-node GPU via OmniPath12.5GB/s fully switched

HBM2 64GB2.5TB/s

DDR4256GB 150GB/s

Intel Optane1.5TB 12GB/s(planned)

NVMe Flash2TB 3GB/s

16GB/s PCIeFully Switched


~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

Terabit class network/node800Gbps (400+400)

full bisection

Any “Big” Data in the system can be moved

to anywhere via RDMA speeds

minimum 12.5GBytes/s

also with Stream Processing

Scalable to all 2160 GPUs, not just 8

10

TSUBAME3: A Massively BYTES Centric Architecture for Converged BD/AI and HPCIntra-node GPU via NVLink

20~40GB/sIntra-node GPU via NVLink

20~40GB/s

Inter-node GPU via OmniPath12.5GB/s fully switched

HBM2 64GB2.5TB/s

DDR4256GB 150GB/s

Intel Optane1.5TB 12GB/s(planned)

NVMe Flash2TB 3GB/s



~4 Terabytes/node Hierarchical Memory for Big Data / AI (c.f. K-compuer 16GB/node) Over 2 Petabytes in TSUBAME3, Can be moved at 54 Terabyte/s or 1.7 Zetabytes / year

Any “Big” Data in the system can be moved

to anywhere via RDMA speeds

minimum 12.5GBytes/s

also with Stream Processing

Scalable to all 2160 GPUs, not just 8

11

TSUBAME3.0 SGI ICE-XA Blade (new)- No exterior cable mess (power, NW, water)- Plan to become a future HPE product

12

TSUBAME3.0 Datacenter

15 SGI ICE-XA Racks2 Network Racks3 DDN Storage Racks20 Total Racks

Compute racks cooled with 32 degrees warm water, Yearound ambient coolingAv. PUE = 1.033

13

Japanese Open Supercomputing Sites Aug. 2017 (pink=HPCI Sites)Peak Rank

Institution System Double FP Rpeak

Nov. 2016 Top500

1 U-Tokyo/Tsukuba UJCAHP

Oakforest-PACS - PRIMERGY CX1640 M1, Intel Xeon Phi 7250 68C 1.4GHz, Intel Omni-Path

24.9 6

2 Tokyo Institute of Technology GSIC

TSUBAME 3.0 - HPE/SGI ICE-XA custom NVIDIA Pascal P100 + Intel Xeon, Intel OmniPath

12.1 NA

3 Riken AICS K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect Fujitsu

11.3 7

4 Tokyo Institute of Technology GSIC

TSUBAME 2.5 - Cluster Platform SL390s G7, Xeon X5670 6C 2.93GHz, Infiniband QDR, NVIDIA K20x NEC/HPE

5.71 40

5 Kyoto University Camphor 2 – Cray XC40 Intel Xeon Phi 68C 1.4Ghz 5.48 336 Japan Aerospace

eXploration AgencySORA-MA - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 1.98GHz, Tofu interconnect 2

3.48 30

7 Information Tech.Center, Nagoya U

Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 2.2GHz, Tofu interconnect 2

3.24 35

8 National Inst. for Fusion Science(NIFS)

Plasma Simulator - Fujitsu PRIMEHPC FX100, SPARC64 XIfx 32C 1.98GHz, Tofu interconnect 2

2.62 48

9 Japan Atomic Energy Agency (JAEA)

SGI ICE X, Xeon E5-2680v3 12C 2.5GHz, Infiniband FDR 2.41 54

10 AIST AI Research Center (AIRC)

AAIC (AIST AI Cloud) – NEC/SMC Cluster, NVIDIA Pascal P100 + Intel Xeon, Infiniband EDR

2.2 NA 14

ビッグデータ・AI用浮動小数演算の精度

0 10 20 30 40 50 60 70

理研

東大

東工大

半精度以上の演算性能での拠点比較

(AI-FLOPS)

TSUBAME3.0 T2.5

京

Oakforest-PACS (JCAHPC)

Reedbush(U&H)

PFLOPS

倍精度 64bit 単精度 32bit 半精度 16bit整数演算

数値解析

シミュレーション計算

コンピュータ・グラフィックス

ゲーミング

ビッグデータ

機械学習・人工知能

65.8 Petaflops

TSUBAME3+2.5+KFCの合算で、機械学習/AIの性能はスパコン・クラウドすべて含め日本トップ

T-KFC

~6700 GPUs + ~4000 CPUs

15

Sparse BYTES: The Graph500 – 2015~2016 – world #1 x 4K Computer #1 Tokyo Tech[Matsuoka EBD CREST] Univ.

Kyushu [Fujisawa Graph CREST], Riken AICS, Fujitsu

List Rank GTEPS Implementation

November 2013 4 5524.12 Top-down only

June 2014 1 17977.05 Efficient hybrid

November 2014 2 19585.2 Efficient hybridJune, Nov 2015June Nov 2016 1 38621.4 Hybrid + Node

Compression

BYTES Rich Machine + Superior

BYTES algoithm

88,000 nodes, 660,000 CPU Cores1.3 Petabyte mem20GB/s Tofu NW

≫

LLNL-IBM Sequoia1.6 million CPUs1.6 Petabyte mem

0

500

1000

1500

64 nodes(Scale 30)

65536 nodes(Scale 40)

Elap

sed

Tim

e (m

s)

Communi…Computati…

73% total exec time wait in

communication

TaihuLight10 million CPUs1.3 Petabyte mem

Effective x13 performance c.f. Linpack

#1 38621.4 GTEPS(#7 10.51PF Top500)

#2 23755.7 GTEPS(#1 93.01PF Top500)

#3 23751 GTEPS(#4 17.17PF Top500)

BYTES, not FLOPS! 16

Distributed Large-Scale Dynamic Graph Data Store (work with LLNL, [SC16 etc.])

Node Level Dynamic Graph Data Store

Extend for multi-processes using an asyncMPI communication framework

Follows an adjacency-list format and leverages an open address hashing to construct its tables

2 billion insertions/s

Inse

rted

Bill

ion E

dges/

sec

Number of Nodes (24 processes per node)

Multi-node Experiment

STINGER• A state-of-the-art dynamic graph processing

framework developed at Georgia TechBaseline model• A naïve implementation using Boost library (C++) and

the MPI communication framework

京での結果を受け、(1) 容量としてのBYTES=深いメモリ階層への対応、(2) 超高速なグラフの変化への対応

K. Iwabuchi, S. Sallinen, R. Pearce, B. V. Essen, M. Gokhale, and S. Matsuoka, Towards a distributed large-scale dynamic graph data store. In 2016 IEEE Interna- tional Parallel and Distributed Processing Symposium Workshops (IPDPSW)

C.f. STINGER (single-node, on memory)

0

200

6 12 24Spe

ed

Up

Parallels

Baseline DegAwareRHH 212x

Dynamic Graph Construction (on-memory & NVM)

K Computer large memory but very expensive DRAM only

Develop algorithms and SW exploiting large hierarchical memory

世界トップクラスの更新性能とスケーラビリティを達成した永続的なグラフストアの開発に成功

17

Xtr2sort: Out-of-core Sorting Acceleration using GPU and Flash NVM [IEEE BigData2016]

• Sample-sort-based Out-of-core Sorting Approach for Deep Memory Hierarchy Systems w/ GPU and Flash NVM– I/O chunking to fit device memory capacity of GPU – Pipeline-based Latency hiding to overlap data transfers between NVM, CPU, and

GPU using asynchronous data transfers, e.g., cudaMemCpyAsync(), libaio

GPU

GPU + CPU + NVM

CPU + NVM

How to combine deepening memory layers for future HPC/Big Data workloads, targeting Post Moore Era?

x4.39

BYTES中心のHPCアルゴリズム：GPUのバンド幅高速ソートと、不揮発性メモリによる大容量化の両立

18

Open Source Release of EBD System Software (install on T3/Amazon/ABCI)

• mrCUDA• rCUDA extension enabling remote-

to-local GPU migration • https://github.com/EBD-

CREST/mrCUDA• GPU 3.0• Co-Funded by NVIDIA

• Huron FS (w/LLNL)• I/O Burst Buffer for Inter Cloud

Environment• https://github.com/EBD-

CREST/cbb• Apache License 2.0• Co-funded by Amazon

• ScaleGraph Python• Python Extension for ScaleGraph

X10-based Distributed Graph Library • https://github.com/EBD-

CREST/scalegraphpython• Eclipse Public License v1.0

• GPUSort• GPU-based Large-scale Sort• https://github.com/EBD-

CREST/gpusort• MIT License

• Others, including dynamic graph store

19

https://github.com/EBD-CREST/mrCUDA

https://github.com/EBD-CREST/cbb

https://github.com/EBD-CREST/scalegraphpython

https://github.com/EBD-CREST/gpusort

資料2-2 最新のHPC技術を生かしたAI・ビッグデータインフラの ... · 2019. 11....

Documents

Transcript of 資料2-2 最新のHPC技術を生かしたAI・ビッグデータインフラの ... · 2019. 11....