Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a...
-
Upload
eleanore-short -
Category
Documents
-
view
218 -
download
3
Transcript of Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a...
![Page 1: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/1.jpg)
Introducing collaboration members – Korea University (KU)ALICE TPC online tracking algorithm on a GPU
Computing Platforms – GPU Computing Platforms
Joohyung SunProf. Hyeonjoong Cho
ALICE Collaboration
Korea University, Sejong
4th ALICE ITS upgrade, MFT and O2 Asian Workshop 2014 @ Pusan
![Page 2: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/2.jpg)
Collaboration Institute, Korea UniversityResearch goalALICE TPC online tracking algorithm on a GPUSpecification of benchmark platform
Introduction
![Page 3: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/3.jpg)
3
Introducing Korea UniversityProf. Hyeonjoong Cho, Embedded Systems and Real-time Computing Laboratory
Meeting of June 19th 2014 in KISTI♦ Proposal of contribution of KISTI and the Korea Univer-
sity to the ALICE O2♦ Participants from KISTI, Korea University, and CERN
♦ One of the suggested possible collaborations Benchmarking of detector-specific algorithms on some agreed
hardware platforms Multi-cores CPU, many-cores CPU, GPGPU, etc.
![Page 4: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/4.jpg)
4
Collaboration institute♦ Prof. H. Cho, Institute Team Leader, Korea University, Se-
jong, Republic of Korea♦ J. Sun, Deputy, Korea University, Sejong, Republic of Korea
Application benchmark on a modern GPU♦ Benchmarking different types of processors
Kepler- and Maxwell-based architecture GPU Maxwell GPU is the successor to the Kepler and is the latest GPU in
this year
♦ Reengineering detector data processing algorithms (GPU tracker)
Apply NVIDIA Kepler’s technologies
Hyper-Q and Dynamic parallelism
Our Research GoalProf. H. Cho and J. Sun, Korea University, Republic of Korea
![Page 5: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/5.jpg)
5
The online event reconstruction ♦ Performs by the High-Level Trigger♦ The most complicated algorithm♦ Adapted to GPUs
GPU evolves into a general-purpose,
massively parallel processor NVIDIA Fermi, CUDA, and AMD
OpenCL
ALICE TPC Online Tracking Algorithm on a GPUDetector-specific algorithms with parallel frameworks
HLT reconstruction scheme
(Reference: David Rohr, CHEP2012)
![Page 6: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/6.jpg)
6
Specification of benchmark platform♦ CPU: Intel i7-4770 CPUs @ 3.4 GHz, 4-cores (HT, 8-cores)
♦ GPU: NVIDIA Tesla K20c GPU Kepler-based architecture 13 Multiprocessors 192 CUDA cores per multiprocessor 706 MHz (0.71 GHz) GPU Clock rate 2600 MHz Memory Clock rate 320-bit Memory Bus Width Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Concurrent copy and kernel execution: Yes with 2 copy engines
Our Research GoalBenchmarking platform
![Page 7: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/7.jpg)
7
Only one work queue♦ It can execute a work at a time♦ CPUs are not able to fully utilize GPU resources
Fermi and Previous Generation GPUsLow the usage of GPU resources
Low usage of GPU resources Even though the GPU has
plenty of computational re-sources
![Page 8: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/8.jpg)
8
Enabling multiple CPU cores to launch work on a single GPU simultaneously
♦ Increasing GPU utilization ♦ Slashing CPU idle times
Hyper-QMaximizing the usage of GPU resources
32 work queues Fully scheduled, synchronized,
and managed all by itself GPUs receive works from
queues at the same time All of the works is being done
concurrently
![Page 9: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/9.jpg)
9
Previous CUDA programming modelThe communication between host and device
Previous CUDA programming model
The communications between CPU and GPU Can affect the application’s per-
formance Each cost as a time is not negli-
gible
![Page 10: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/10.jpg)
10
Enabling GPU to dynamically spawn new threads
♦ By adapting to the data ♦ Without going back to the host CPU
Dynamic ParallelismCreating work on-the-fly
CUDA programming model in Kepler
Effectively allows to be run directly on GPU
Saving the time for communications
![Page 11: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/11.jpg)
Previous worksCurrent progressOptimization with NVIDIA Visual Profiler
Progress
![Page 12: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/12.jpg)
12
Some results of benchmarking HLT tracker on each GPU
♦ NVIDIA Fermi (current version) 174 ms♦ NVIDIA GTX780 (Kepler) 155 ms♦ NVIDIA Titan (Kepler) 146 ms♦ AMD GCN 160 ms
Previous WorksBenchmarking HLT tracker
Reference: P. Buncic and et al., “O2 Project”, ALICE LHCC Referee Meeting
![Page 13: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/13.jpg)
13
Application benchmark♦ Tested on Kepler-based architecture GPU
Maxwell-based architecture GPU will be benchmarked
♦ To fully utilize the compute and data movement capabili-ties of the GPU
Optimization Hyper-Q is applied for enabling concurrent copy and kernel execu-
tion Dynamic parallelism will be applied for reducing the number of
communications between CPU and GPU
Our Current ProgressALICE TPC online tracking algorithm on a GPU
![Page 14: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/14.jpg)
14
Profiling GeForce GTX650 with 2 streams♦ Works are managed by one work queue
♦ All other copy and kernel executions wait for previ-ous executions
Comparison of Hyper-Q between GTX650 and K20cNVIDIA Visual Profiler (nvvp)
Kernel execution of algorithm
Memory copy execution from CPU to GPU
![Page 15: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/15.jpg)
Profiling Tesla K20c with 2 streams♦ 32 work queues for concurrent executions
♦ Copy and kernels are executed concurrently
Comparison of Hyper-Q between GTX650 and K20cNVIDIA Visual Profiler (nvvp)
15
Kernel execution of algorithm
Memory copy execution from CPU to GPU
If the number of streams is increased,
then how does it work?
![Page 16: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/16.jpg)
16
Tesla K20c with 8 streams
More Concurrent ExecutionsTesla K20c, 8 streams
♦ Copy and kernels in some of the streams more than two are executed concurrently
How about using more streams?
![Page 17: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/17.jpg)
17
Measuring specific compute kernels’ time per the number of streams
♦ The number of streams: 2~36
Observation from Multiple StreamsThe number of streams
PreInitRowBlocks <<< >>>cudaMemcpyAsync (…, cudaMemcpyHostToDevice, …)AliHLTTPCCANeighboursFinder <<< >>>AliHLTTPCCANeighboursCleaner <<< >>>AliHLTTPCCAStartHitsFinder <<< >>>AliHLTTPCCAStartHitsSorter <<< >>>
It significantly reduces the tracking time
compared with using 2 streams
![Page 18: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/18.jpg)
18
The number of copy engines in GPU♦ E.g. Tesla K20c has only 2 copy engines♦ Limit as the number of works can be executed concur-
rently
Too short kernel execution time♦ It could be finished before another kernel execution is ar-
rived♦ The longest kernel execution time
Only about 2 ms during this test
This observation will be a key♦ For optimizing Hyper-Q
Possible Reasons for ObservationThe key for optimization
![Page 19: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/19.jpg)
19
Korea University♦ Prof. Hyeonjoong Cho, Institute Team Leader, Korea Uni-
versity, Sejong, Republic of Korea♦ Joohyung Sun, Deputy, Korea University, Sejong, Repub-
lic of Korea
Next research plans♦ Benchmarking Maxwell-based architecture GPU
GeForce GTX 980, about $ 549
♦ Efficiently applying GPU’s technologies Hyper-Q with scheduling of streams Dynamic parallelism with device memory management
SummaryNext research plans
![Page 20: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/20.jpg)
20
Appendix. Actual Code for Dynamic ParallelismCreating work on-the-fly
dgetrf(N,N) {
for j=1 to N
for i=1 to 64
idamx <<< >>>
memcpy
dswap <<< >>>
memcpy
dscal <<< >>>
dger <<< >>>
next i
memcpy
dlaswap <<< >>>
dtrsm <<< >>>
dgemm <<< >>>
next j
}
LU decomposition (Fermi)
idamx ();
dswap ();
dscal ();
dger ();
dlaswap ();
dtrsm ();
dgemm ();
.CPU code GPU code
![Page 21: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/21.jpg)
21
Appendix. Actual Code for Dynamic ParallelismCreating work on-the-fly
dgetrf(N,N) {
dgetrf <<< >>>
synchronize();
}
LU decomposition (Kepler)
dgetrf(N,N) {
for j=1 to N
for i=1 to 64
idamx <<< >>>
dswap <<< >>>
dscal <<< >>>
dger <<< >>>
next i
dlaswap <<< >>>
dtrsm <<< >>>
dgemm <<< >>>
next j
}
CPU code GPU code
CP
U is
Fre
e
![Page 22: Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.](https://reader036.fdocuments.net/reader036/viewer/2022062517/56649f045503460f94c186f6/html5/thumbnails/22.jpg)
22
Appendix. Example of LU Decomposition Profiling LU Decomposition using NVIDIA Visual Profiler (nvvp)
Tesla K20c, Context 1 (CUDA)
Memcpy
cgetrf_cdpentry <<< >>>