Scale-out Computing Model on Massive Core System: From HPC ... · Scale-out Computing Model on...
Transcript of Scale-out Computing Model on Massive Core System: From HPC ... · Scale-out Computing Model on...
-
Scale-out Computing Model on Massive Core System: From HPC to Fabric-Based SoC
Dr. Fu Li [email protected]
Quantum Cloud Future (Beijing) Technologies Co., Ltd.
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Cook Book
1. What is Massive Core System (MCS)? 1.1. HPC system 1.2. GPU system 1.3. MicroSlides: Fabric-based SoC
2. Why scale-out computing is important in MCS? 3. How to make MCS faster?
3.1. MPI and openMP in HPC 3.2. Memory coalescing and cudaDMA in GPU computing
4. QCF’s scale-out computing model for Microslides 4.1. the hardware (Socionext) 4.2. the architecture 4.3. the result (arm vs x86 vs GPU)
new
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Quantum Theory and SpectroscopyMolecular Dynamics Fast Fourier Transform
HPC
Content-Centric Networking
Cloud StorageDoppler ASIC Boba FPGA
MPI, OpenMP CUDAStatistic MechanicsGPU switch
PacketShader
Introduction to Quantum Cloud
With background from Quantum calculation, 1) we perform large-scale molecular dynamics simulation on HPC cluster using
Amber and Gromacs, 2) we optimize Fourier transform and matrix operation on multicore system.
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Then we found GPU is a great tool for both molecular dynamics and matrix operation.
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Later we found similar systems with massive CPU cores.
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Introduction to Quantum Cloud
Today we will show some practical example about our scale-out algorithm on these systems
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Num
ber o
f Cor
es
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server
Blade Server
Super Computer
General-purpose
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Num
ber o
f Cor
es
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server
Blade Server
Super Computer
GPU
GPU Cluster
General-purpose
Special-purpose
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Num
ber o
f Cor
es
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server
Blade Server
Super Computer
GPU
GPU Cluster
General-purpose
Special-purpose
Traditional ARM Server
ARM SoC
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Num
ber o
f Cor
es
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server
Blade Server
Super Computer
GPU
GPU ClusterMicroslides
Special-purpose
General-purpose
General-purpose
Microslides of ARM CPU
Microslides of ARM SoC
Traditional ARM Server
ARM SoC
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Num
ber o
f Cor
es
1
10
100
1,000
10,000
100,000
System Power Consumption (Watts)10 100 1000 10K 100k 1M
System and Cores: Communication Matters
QCF & SOCIONEXT
PC Server
Blade Server
Super Computer
GPU
GPU ClusterMicroslides
Microslides of ARM CPU
Microslides of ARM SoC
2006 20182012
intra CPU connectioninter CPU connectioncluster connection
Special-purpose
General-purpose
General-purpose
Traditional ARM Server
ARM SoC
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
Cache/Storage
I/O
Hierarchical structure is critical for Von Neumann architecture
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
Cache/Storage
I/O
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1instruction-level parallelism
OS-level parallelism
algorithm-level parallelism
Cache/Storage
I/O
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1instruction-level parallelism
OS-level parallelism
algorithm-level parallelism
batch, share-nothing stateless computing
big RAM avoid context switching TLB, cache-conscious
big.LITTLE GPU, FPGA
Fast cache, cache prefetch Vector processing, SIMD/AVX
Cache/Storage
I/O
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Data Communication Between Systems Is Obstacle
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1
cores
Intra CPU Fabric
Sockets Bus
Memory
Networking
Cache L2/L3
Cache L1instruction-level parallelism
OS-level parallelism
algorithm-level parallelism
batch, share-nothing stateless computing
big RAM avoid context switching TLB, cache-conscious
big.LITTLE GPU, FPGA
Fast cache, cache prefetch Vector processing, SIMD/AVX
Cache/Storage
I/O
Consolidation will be the next-wave innovation for Chip design and system optimization • IO consolidation: networking, bus, fabric • storage consolidation: memory, cache, networking buffer
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Parallel and Scaling
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Fabric-Based ARM SoC
From SOCIONEXT
• PCIe Fabric for networking • 768 cores • c2c 10Gbps, 36 microsec latency • 1TB DDR4 RAM • 700 watts TDP per chassis
watt/coreARM SoC 1
x86 16 ~ 25GPU 0.3~0.5
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Cluster Management Tools
PBS openstack kubernetes mesos
basic batch process kvm container container/noncontainer
provery fast
very flexible normally with MPI
very secure very stable
system-level isolation
fast secure
production ready
fast compatible with
process and container production ready
can be securecons no isolation high overhead slow
container app not flexible enough complexity
scenario scientific calculation private cloud application CI Datacenter OS
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Share-Nothing + Message Queue Architecture
Stateless 计算架构
host
core core IO core use an “individual” core to do IO for the host to increase the throughput
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: PacketShader on GPU
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: Rendering on Arm
Render@Baremetal
Render@Container
0
1
2
3
4
buggy fishy cat bmps teeglasFX splash poked
Intel ARM
0
0.5
1
1.5
2
bmw27 classroom bechmark
Baremetal 1container 2container 4container
并发情况下提⾼高3倍
多实例例并发情况下提⾼高1.8倍
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: Rendering on Arm
0
7.5
15
22.5
30
performace scaled 1 scaled 2Intel arm SoC Intel arm SoC Intel arm SoC
scaled 1: scaled performance with frequency and core number scaled 2: scaled performance with frequency and core number and watts
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: AI on Arm
Caffe@Container ARM vs Intel vs GPU (scaled)
0
0.4
0.8
1.2
1.6
CIFAR 10 - 1 CIFAR 10 -2 CIFAR 10 - 3
Intel ARM GPU 1070
-
Quantum Cloud Future (Beijing) Technology Co. Ltd.
Example: AI on Arm SoC
048
1216
caffe scaled caffe darknet scaled darknetIntel SoC Intel SoC Intel SoC Intel SoC
02.254.5
6.759
caffe scaled caffe darknet scaled darknetIntel SoC Intel SoC Intel SoC Intel SoC
Training
Inference
-
量量⼦子云未来(北北京)信息科技有限公司(以下称量量⼦子云)是⼀一家以影视⾏行行业为主的垂直⾏行行业云计算公司。
量量⼦子云专注于影视⾏行行业的云化,和国际知名影视公司和特效制作公司合作,为影视⾏行行业客户提供制作软件、图形⼯工作站、⾼高性能存储、渲染服务等⼀一站式解决⽅方案等。
ADDRESS 北北京市朝阳区⼯工体北北路路8号三⾥里里屯SOHO办公A座2101NUMBER
EMAIL [email protected] WEBSITE
010-53518265
www.lzyco.com
THANKS