Post on 28-Dec-2015
““The Architecture of The Architecture of Massively Parallel Processor Massively Parallel Processor
CP-PACS”CP-PACS”Taisuke Boku, Hiroshi Nakamura, et al.Taisuke Boku, Hiroshi Nakamura, et al.
University of Tsukuba, JapanUniversity of Tsukuba, Japan
by Emre Tapcıby Emre Tapcı
OutlineOutline
IntroductionIntroduction Specification of CP-PACSSpecification of CP-PACS Pseudo Vector Processor PVP-SWPseudo Vector Processor PVP-SW Interconnection Network of CP-PACSInterconnection Network of CP-PACS
Hyper-crossbar NetworkHyper-crossbar Network Remote DMA message transferRemote DMA message transfer Message broadcastingMessage broadcasting Barrier synchronizationBarrier synchronization
Performance EvaluationPerformance Evaluation Conclusion, References, Questions & Conclusion, References, Questions &
CommentsComments
IntroductionIntroduction
CP-PACS: Computational Physics by Parallel CP-PACS: Computational Physics by Parallel Array Computer SystemsArray Computer Systems
To construct a dedicated MMP for To construct a dedicated MMP for computational physics, study Quantum-computational physics, study Quantum-Chromo DynamicsChromo Dynamics
Center for Computational Physics, Center for Computational Physics, University of Tsukaba, JapanUniversity of Tsukaba, Japan
Specification of CP-PACSSpecification of CP-PACS
MIMD parallel processing system with MIMD parallel processing system with distributed memory.distributed memory.
Each Processing Unit (PU) has a RISC Each Processing Unit (PU) has a RISC processor and a local memory.processor and a local memory.
2048 of such PU’s, connected by an 2048 of such PU’s, connected by an interconnection network.interconnection network.
128 IO units, that support a 128 IO units, that support a distributed disk space.distributed disk space.
Specification of CP-PACSSpecification of CP-PACS
Specification of CP-PACSSpecification of CP-PACS
Theoretical performanceTheoretical performance To be able to solve problems like QCD, To be able to solve problems like QCD,
Astro-fluid dynamics, etc. a grat number Astro-fluid dynamics, etc. a grat number of PUs are required.of PUs are required.
For budget, reliability reasons, number For budget, reliability reasons, number of PUs is limited at 2048.of PUs is limited at 2048.
Specification of CP-PACSSpecification of CP-PACS
Node processorNode processor Improve function of node processors Improve function of node processors
first.first. Caches do not work efficiently on Caches do not work efficiently on
ordinary RISC processors.ordinary RISC processors. New technique for cache function is New technique for cache function is
introduced: introduced: PVP-SWPVP-SW
Specification of CP-PACSSpecification of CP-PACS
Interconnection NetworkInterconnection Network 3-dimensional Hyper-Crossbar (3-D HXB)3-dimensional Hyper-Crossbar (3-D HXB) Peak throughput of a single link: 300 Peak throughput of a single link: 300
MB/secMB/sec ProvidesProvides
Hardware message broadcastingHardware message broadcasting Block-stride message transferBlock-stride message transfer Barrier synchronizationBarrier synchronization
Specification of CP-PACSSpecification of CP-PACS
I/O systemI/O system 128 I/O units, equipped with RAID-5 hard 128 I/O units, equipped with RAID-5 hard
disk system.disk system. 528 GB total system disk space.528 GB total system disk space. RAID-5 system increases fault tolerance.RAID-5 system increases fault tolerance.
Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW
MPPs require high performance node MPPs require high performance node processors.processors.
A node processor cannot achieve A node processor cannot achieve high performance unless cache high performance unless cache system works efficiently.system works efficiently. Little temporal locality existsLittle temporal locality exists Data space of application is much larger Data space of application is much larger
than cache size.than cache size.
Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW
Vector processorsVector processors Main memory is pipelined.Main memory is pipelined. Vector length of load/store is long.Vector length of load/store is long. Load/store is executed in parallel with Load/store is executed in parallel with
arithmetic execution.arithmetic execution. We require these in our node We require these in our node
processorprocessor PVP-SW is introduced.PVP-SW is introduced. It is It is pseudo-vectorpseudo-vector..
Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW
Cannot increase number of registers, Cannot increase number of registers, register field in instructions is register field in instructions is limited.limited.
So, a new technique, So, a new technique, Slide-Windowed Slide-Windowed RegistersRegisters is introduced. is introduced.
Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW
Slide-Windowed RegistersSlide-Windowed Registers Physical registers consist of logical Physical registers consist of logical
windows, a window consists of 32 windows, a window consists of 32 registers.registers.
Total number of registers is 128.Total number of registers is 128. Global registers & Window registersGlobal registers & Window registers
Global registers are static and shared by all Global registers are static and shared by all windowswindows
Local registers are not shared.Local registers are not shared. One window active at a certain time.One window active at a certain time.
Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW
Slide-Windowed RegistersSlide-Windowed Registers Active window is identified by a pointer, FW-Active window is identified by a pointer, FW-
STP.STP. New instructions are introduced, to deal with New instructions are introduced, to deal with
FW-STP:FW-STP: FWSTPSet: Sets new location for FW-STP.FWSTPSet: Sets new location for FW-STP. FRPreload: Load data from memory into a window.FRPreload: Load data from memory into a window. FRPoststore: Store data into memory from a FRPoststore: Store data into memory from a
window.window.
Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW
Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS
Topology is a Topology is a Hyper-Crossbar Network Hyper-Crossbar Network (HXB)(HXB)
8 x 17 x 16, 2048 PUs, 128 I/O units.8 x 17 x 16, 2048 PUs, 128 I/O units. On a dimension of hypercube, the PUs are On a dimension of hypercube, the PUs are
interconnected by a crossbar.interconnected by a crossbar. For example: On Y dimension, a Y x Y size For example: On Y dimension, a Y x Y size
crossbar is used.crossbar is used. Routing is simple, route on 3 dimensions Routing is simple, route on 3 dimensions
consecutively.consecutively. Wormhole routing is employed.Wormhole routing is employed.
Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS
Wormhole routing & HXB together Wormhole routing & HXB together has these properties:has these properties: Small network diameterSmall network diameter Same sized torus can be simulated.Same sized torus can be simulated. Message broadcasting by hardware.Message broadcasting by hardware. Binary hypercube can be emulated.Binary hypercube can be emulated. Througput in even random transfer is Througput in even random transfer is
high.high.
Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS
Remote DMA transferRemote DMA transfer Making a system call to OS and copying Making a system call to OS and copying
data to OS area is messy.data to OS area is messy. Instead, access remote node’s memory Instead, access remote node’s memory
directly.directly. Remote DMA is good, because:Remote DMA is good, because:
Mode switching (kernel/user mode) is Mode switching (kernel/user mode) is tedious.tedious.
Redundant data copying (user Redundant data copying (user kernel kernel space) is not done.space) is not done.
Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS
Message BroadcastingMessage Broadcasting Supported by hardware.Supported by hardware.
First, perform on one dimensionFirst, perform on one dimension Then perform on other dimensionsThen perform on other dimensions
Hardware mechanisms to prevent deadlock Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same caused by two nodes broadcasting at the same time are present.time are present.
Hardware partitioning is possible.Hardware partitioning is possible. Send broadcast message to nodes in the sender’s Send broadcast message to nodes in the sender’s
partition only.partition only.
Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS
Barrier SynchronizationBarrier Synchronization A synchronization mechanism is A synchronization mechanism is
required in IPC systems.required in IPC systems. CP-PACS supports a hardware barrier CP-PACS supports a hardware barrier
synchronization facility.synchronization facility. Makes use of special syncronization packets, Makes use of special syncronization packets,
other than usual data packets.other than usual data packets. CP-PACS also supports partitioned CP-PACS also supports partitioned
pieces of network to use barrier pieces of network to use barrier synchronization.synchronization.
Performance EvaluationPerformance Evaluation
Based on LINPACK benchmark.Based on LINPACK benchmark. LU decomposition of a matrix.LU decomposition of a matrix. Outer product method is used, based on 2-Outer product method is used, based on 2-
dimensional block-cyclic distribution.dimensional block-cyclic distribution. All floating point and data loading/storing All floating point and data loading/storing
operations are done in PVP-SW manner.operations are done in PVP-SW manner.
Performance EvaluationPerformance Evaluation
Rmax
050100150200250300350400
1 2 3 4 5 6 7 8 9 10 11 12
# of PUs ( 2^ )
Rm
ax
(G
flo
ps
/se
c)
Performance EvaluationPerformance Evaluation
Nmax
0
20000
40000
60000
80000
100000
120000
1 2 3 4 5 6 7 8 9 10 11 12
# of PUs ( 2^ )
Ma
trix
siz
e
Performance EvaluationPerformance Evaluation
Rmax/peak
56
58
60
62
64
66
68
1 2 3 4 5 6 7 8 9 10 11 12
# of PUs ( 2^ )
eff
ec
tiv
en
es
s
ConclusionConclusion
CP-PACS is operational in University CP-PACS is operational in University of Tsukuba.of Tsukuba.
Working on large scale QCD Working on large scale QCD calculations.calculations.
Sponsored by Hitachi Ltd. & Grant-in-Sponsored by Hitachi Ltd. & Grant-in-aid of Ministry of Education, Science aid of Ministry of Education, Science of Culture, in Japan.of Culture, in Japan.
ReferencesReferences
T.Boku, H. Nakamura, K. Nakazawa, T.Boku, H. Nakamura, K. Nakazawa, Y. Iwasaki, Y. Iwasaki, The architecture of The architecture of Massively Parallel Processor CP-Massively Parallel Processor CP-PACS, PACS, Institute of Information Institute of Information Sciences and Electronics, University Sciences and Electronics, University of Tsukubaof Tsukuba
Questions & CommentsQuestions & Comments