Applied AI Architecture @ Alibaba Infrastructure€¦ · @ 2018 Alibaba Group Who we are Data...
Transcript of Applied AI Architecture @ Alibaba Infrastructure€¦ · @ 2018 Alibaba Group Who we are Data...
@ 2018 Alibaba Group
Applied AI Architecture @ Alibaba Infrastructure
Lingjie XU, DirectorHeterogeneous Computing
AIS
@ 2018 Alibaba Group
Who I am
Joined Alibaba Spring 2017
Leading Applied AI Architecture team, focusing on
AI HW acceleration and HW/SW co-play
Held multiple senior architect and management
roles in GPU domain
@ 2018 Alibaba Group
Who we are
Data Platform
Alibaba Cloud
Cain
iao
Logistics
Tao
bao
Tm
all
Alib
ab
a.co
m
1688.co
m
Alie
xpre
ss
E-Commerce
Alip
ay
Micro
-Cre
dit
Insu
ran
ce
Fun
ds
Finance
Alibaba Infrastructure Service
Juh
uasu
an
Clo
ud
Partn
ers
Priva
te clo
ud
Sp
ecia
l clou
d
Pu
blic clo
ud
@ 2018 Alibaba Group
Alibaba Infrastructure
IDC
Modular
Eco-Friendly
Automation
Network
100G
SDN
Security
GOC
Monitor
Analyze
Act
Server
High Perf
Low Power
Scalability
@ 2018 Alibaba Group
Technology Overview
Business Platform
BIRecommendation NLP VisionSearch …
Algorithm Platform Data Platform Computing Platform
OS Middleware Storage Database
IDC Server Processor Network Operation
Effic
ien
cy
Se
cu
rity
@ 2018 Alibaba Group
Datacenters
Zhangbei Datacenter
(Fresh air cooling system)
Best PUE <1.2
New FrontierServer immersion cooling
PUE ~1.0
Qiandaohu Datacenter
(Lake water cooling system)
PUE < 1.3
@ 2018 Alibaba Group
Network
Massive Scale + Diverse Applications + Bursty Traffic + Fast Growth
@ 2018 Alibaba Group
Global Infrastructure
@ 2018 Alibaba Group
Compute & Storage
NPU
@ 2018 Alibaba Group
GN6: 8-way GPU Server
• SXM2 or PCIe
• Decoupled modular design
• Configurable topology
Balanced Common Cascade
CPU0
PCIe
Switch
G
P
U
G
P
U
G
P
U
G
P
U
CPU1
PCIe
Switch
PCIe
Switch
G
P
U
G
P
U
G
P
U
G
P
U
PCIe
Switch
CPU0
PCIe
Switch
G
P
U
G
P
U
G
P
U
G
P
U
CPU1
PCIe
Switch
PCIe
Switch
G
P
U
G
P
U
G
P
U
G
P
U
PCIe
Switch
CPU0
PCIe
Switch
G
P
U
G
P
U
G
P
U
G
P
U
CPU1
PCIe
Switch
PCIe
Switch
G
P
U
G
P
U
G
P
U
G
P
U
PCIe
Switch
@ 2018 Alibaba Group
Data Computing Power Algorithm
The Wave of AI Revolution
@ 2018 Alibaba Group
Deep Learning @ Alibaba
CloudSearch
PAI
iDST
Ant
Ads
City Brain
New Retail
Database Acceleration
Video Analysis
NLP
Cloud
28.2
25.8
16.4
11.7
7.36.7
3.572.99
shallow8 layers
19 layers22 layers
152 layers
269 layers
0
50
100
150
200
250
300
0
5
10
15
20
25
30
ILSVRC'10 ILSVRC'11 ILSVRC'12AlexNet
ILSVRC'13 ILSVRC'14VGG
ILSVRC'14GoogleNet
ILSVRC'15ResNet
ILSVRC'16GBD-Net
Layers
Err
or
%
ImageNet Classification Top-5 Error %
Deep Learning Evolution
@ 2018 Alibaba Group
PaiLiTao
• Category Prediction
• Object Detection
• Feature Extraction
• Index Searching
• Soring & Output
@ 2018 Alibaba Group
OCR
• 10s Millions of Image
• CNN Model
• Single character
accuracy 99.6%
• Overall accuracy 93%
• 8 way distributed GPU
solution
• 7x training speed
@ 2018 Alibaba Group
Translation Voice Insurance
Deep Learning Everywhere
@ 2018 Alibaba Group
Heterogeneous Machine Learning Platform
@ 2018 Alibaba Group
*data from NVIDIA GTC 2017
Hardware Accelerated AI
• Training: Compute Intensive, Time Cost
• Inference: Service Oriented, Response Time
• Eco-System: Framework, Libs, Precisions
• Hardware Dividends for everyone
Tipping point:
• Google TPUs
• Volta TensorCore
• New hardware accelerators for AI
@ 2018 Alibaba Group
Edge – Forces of Gravity
Data
Latency
Privacy
TCO
Customer
Device
Premises
Hyperscale
DC
1-8KM
<1km 8-20km
Reginal DC
Core DC
@ 2018 Alibaba Group
Function Computing
Lower Dev Cost
On-demand Use
Platform Differentiation
Increased Utilization
Less Control
Not Portable
Private Cloud
Increased costs for
optimzation
Technology Challenges
• Quality of Service
• Infrastructure Utilization
• Accelerator Efficiency
• Capacity Granularity
• Multi-Tenancy Management
• Demand Projection
• Scheduling
• Compatibility
@ 2018 Alibaba Group
Opportunities & Challenges
Fine Grained
Monitoring
For Efficiency
Perf
Power
Stability
Deep Customization
Best TCO
Competitive
Empower
“Traditional”
Algorithms