[IEEE 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques...
Transcript of [IEEE 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques...
EFFICIENT PARALLEL PROCESSING BY IMPROVED CPU-GPU
INTERACTION
Harsh Khatter Department of Computer Science and Engineering
ABES Engineering College, Ghaziabad, India harsh _ [email protected]
Abstract: In this digital world, more than 90% of desktop and
notebook computers have integrated Graphics Processing Units
i.e. GPU's, for better graphics processing. Graphics Processing
Unit is not only for graphics applications, even for non
graphics applications too. In the past few years, the graphics
programmable processor has evolved into an increasingly
convincing computational resource. But GPU sits idle if
graphics job queue is empty, which decreases the GPU's
ejficiency. This paper focuses on various tact to overcome this
problem and to make the CPU-GPU processing more powerful
and efficient. The graphics programmable processor or
Graphics processing unit is especially well suited to address
problem sets expressed as data parallel computation with the
same program executed on many data elements concurrently.
The objective of this paper is to increase the capabilities and
flexibility of recent GPU hardware combined with high level
GPU programming languages: to accelerate the building of
images in a frame buffer intended for output to a display, and,
to provide tremendous acceleration jor numerically intensive
scientific applications. This paper also gives some light on
major applicative areas where GPU is in use and where GPU
can be used in future.
1. INTRODUCTION
Over past few years, Graphics processing unit has rapidly
come into existence. As the name is saying, Graphics
processing unit (GPU) is not only for graphics processing,
but also for highly parallel programmable tasks. GPU has a
capability to change itself as per the task and processing
requirement.
Recent years have seen a trend in using graphic processing
units (GPU) as accelerators for general-purpose computing. The inexpensive, single-chip, massively
parallel architecture of GPU has evidentially brought
factors of speedup to many numerical applications.
However, the development of a high-quality GPU
application is challenging, due to the large optimization
space and complex unpredictable effects of optimizations
on GPU program performance [11].
Feng and Xiao [10] stated that the graphics processing unit
(GPU) has evolved from being a fixed -function processor
with programmable stages into a programmable processor
with many fixed-function components that deliver massive
parallelism. By modifying the GPU's stream processor to
support "general-purpose computation" on the GPU (GPGPU), applications that perform massive vector
operations can realize many orders-of-magnitude
improvement in performance over a traditional processor,
i.e. CPU.
978-1-4799-2900-9/14/$31.00 ©2014 IEEE 159
Vaishali Aggarwal Department of Computer Science and Engineering
KIET, Ghaziabad, India [email protected]
Originally, the massive parallelism offered by the GPU only supported calculations for 3D computer graphics,
such as texture mapping, polygon rendering, vertex
rotation and translation, and oversampling and
interpolation to reduce aliasing. However, because many of
these graphics computations entail matrix and vector
operations, the GPU is also increasingly being used for
non-graphical calculations.
GPUs typically map well only to data-parallel or task
parallel applications whose execution requires relatively
minimal communication between streaming
multiprocessors (SMs) on the GPU [4], [6], [7], [9].
In general case, GPU comprises of massive parallelism
with hundred of cores in it. Thousands of threads are
involve in its processing whose execution is cheaper than
the CPU's execution. A special programming methodology
is available for GPU's named as CUDA.
There is a myth is mind of persons that GPU is only
designed for the graphics processing and graphics tasks.
Basically, GPU is designed for a particular class of
applications with the following characteristics. Over the
past decade, a new era has identified other applications
with similar characteristics and successful mapped these
applications onto the GPU.
• Computational requirement is large. • Parallelism is substantial. • Throughput is more important than latency.
2. GPU APPLICA TIVE AREAS: A REVIEW
2.1 Traditional Apps: Graphics and Gaming
Initially, there were normal VGA video cards for better
graphics and resolution in personal computers. With time,
VGA got more power to process graphics and adjust colours. Further, It evolved into the better graphics
processing engines, named Graphics processing units,
GPu. From last decade every laptop has its in-built GPU
along with cpu. Most common use of GPU is in Gaming
for better visualization. In addition, higher resolution
support is also possible due to GPU processors.
But, the problem was GPU sits idle when there is no job
for graphics processing, so the apps are changed a bit,
GPU is treating like another CPU. Modem application
works on this phenomenon.
160
2.2 Modern Applications
Now-a-days, advance GPU's deliver a high compute
capacity at low cost, making them attractive for use in scientific computing. Major ones are listed below:
2.2.1 Parallel Computing: • Specialized data-parallel computing is a special
feature of advance GPU computing where CPU and
GPU work in parallel. Numbers of application are based on this data parallelism approach.
• Dan Chen et al. proposed a parallelized EEMD
method that has been developed using generalpurpose computing on the graphics processing unit
(GPGPU), namely, G-EEMD. Spectral entropy
facilitated by G-EEMD was proposed to analyse the EEG data for estimating the depth of anaesthesia
(DoA) in a real-time manner [1].
GPU is not only restricted to the analytical purpose or
theoretical work. Numbers of real time applications are based on GPU processing.
2.2.2 Real Time Applications: • Stephen Rosen et al. elaborated their work to develop
a GPU-based real- time acoustic radiation force I pulse (ARFI) data processing system for use in
clinical feasibility studies of ARFI imaging. The two principle computational steps required to achieve this goal are data interpolation and displacement
estimation [8]. • K. Zhang and J.u. Kang have implemented
Numerical dispersion compensation for both standard
and Full-range Fourier domain optical cohesion tomography on the GPU architecture [5]. In terms of GPU architecture, GPI is built for different
application demands than the CPU: large, parallel computation requirements with an emphasis on
throughput rather than latency. • GPU also works well in cryptography like encryption
decryption methods. On one side where CPU is
doing encryption, GPU works parallel in Key
generation algorithms. Similarly, GPU works well in other tasks parallel to CPU to enhance the
performance of CPU. This is also utilising the GPU power when GPU is sitting idle. Applications are not restrict only to the mentioned domains. Infinite
numbers of areas are using GPU and CPU together for better working environment.
3. CPU-GPU INTERACTION
GPU processing is completely handled by the CPU.
Perhaps, the working of both can be different. CPU usually assigns the tasks to the GPU. Data parallelism is also an
example of this collaboration of CPU-GPU processing. If, we will put some light on general CPU-GPU Analogies, CPU has inner loops whereas GPU based on kernel part.
CPU works on streams or data arrays and GPU on texture. CPU performs memory read operations and GPU texture
samples. Major one is to promote data parallelism using CPU-GPU interaction. The most recent CPUs have upto 8-
12 cores, while the GPUs possess hundreds or even thousands of stream processor cores [2].
The GPU has several key advantages over CPU
architectures for highly parallel, compute intensive workloads, including higher memory bandwidth,
significantly higher floating-point throughput, and
thousands of hardware thread contexts with hundreds of parallel compute pipelines executing programs in a SIMD fashion. The GPU provides parallel processing of large quantities of data relative to what can be provided by a
general CPU. The GPU can be an attractive altemative to CPU clusters in high performance computing
environments [3].
Now, question is how GPU involve in achieving data
parallelism? So, there is a list of few points which ensures
importance of GPU in Data Parallelism
• GPUs are designed for graphics and working towards
highly parallel tasks • GPUs process independent vertices & fragments.
Temporary registers are zeroed and No shared or static
data is used. There are no read-modify-write buffers. Only parallel work is on which achieves Data-parallel
processing. • GPUs architecture is ALU based which consists of
multiple vertex & pixel pipelines, multiple ALUs per
pipe. • GPU also Hides memory latency and with more
computation
3.1 Constraints in CPU-GPU computing
As working with CPU and GPU together, there are number of constraints which are observed. List of constraints is
given below:
• High latency between CPU-GPU • Handling control flow graphs • I/O access • Bit operations • Limited data structure
Based on these constraints, the possible number of solutions is proposed in the next section.
4. PROPOSED SOLUTION TO BRIDGE GAP
GPU is specially designed for graphics computing, but apart from this, it is also capable to do other general
purpose tasks too. But, it is based on various parameters
that which task will be performed by CPU, and which will be done by GPU. This is a problem to judge all. This is
done using decision making by using thresholds. Some
solutions based on categories are discussed below:
• Based on burst time, communication time and
selection:
To keep a watch on the latency threshold, the burst
time of task/process has to be noticed and compared
with the threshold. For example, threshold is Communication cost, C, then GPU will be selected if
the burst time is greater than the twice of the communication cost between CPU and GPU otherwise
task will be processed by CPU only.
2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)
• Based on 110 access and memory access:
GPU should have a direct access to 110 and memory so that it doesn't require CPU to interact with peripherals.
In collaborative tasks priority must be set CPU, i.e. CPU is on higher priority than GPU for 110 access.
• Based on the tasks:
Small and user defined tasks will be processed by
GPU, while system processes will be taken by CPU.
• Bulky tasks :
Bulky tasks can be divided equally for CPU & GPU processing as in parallel computing happens
• Based on resources:
Based on the resource allocation and requirement of resources by tasks, GPU will take less resource
requirement tasks while CPU takes other ones.
Still some revision in required to the architecture of GPU, so that it will also has decision making capability of
distributing the tasks and works among other cores. Now, GPU has a cache, GPU has cores, GPU can also perform other specific tasks more easily than CPU. But, GPU is
especially for Graphics and low tasks. As, i7 4th generation is trying to merge up GPU and CPU, there is a requirement to make GPU as capable as Cpu. Some demerits of
merging up CPU-GPU, memory is same, cache is same, load will take only by one, if graphics are bulking, it will stop normal processing work too, in GPU architecture
lOOO s of cores are there, which can easily cope up with the graphics and the image parameters, but in CPU, we have at most 6 cores now. Therefore, it's not a solution to merge
CPU and GPU on a single hybrid processor.
5. CONCLUSION
Focus of paper is on the improved CPU-GPU interaction for efficient parallel computing. How this hybrid approach results to parallel computing. The parametric constraints where still needs revisions for efficient parallel processing. Paper has discussed this in brief with some suitable solutions for choosing the processing units i.e. CPU or
GPU.
The number of applicative areas are discussed where GPUCPU is working well. Numbers of parameters are
discussed in paper, which are reflecting the tasks distribution among the CPU and GPU for improved and efficient computing. Paper also highlighted various
constraints of GPU and how it can be overcome. Further the working will also be improved by programming perspective change. Ensuing algorithms on CUDA can also
improve the efficiency of CPU-GPU processing. This collaboration of CPU-GPU will surely enhance the
efficiency of processing tasks.
REFERENCES
[I] Dan Chen, Duan Li, Muzhou Xiong, Hong Bao, and Xiaoli
Li, 'GPGPU-Aided Ensemble Empirical-Mode
Decomposition for EEG Analysis during Anesthesia', IEEE
Transactions on Information Technology in Biomedicine,
Vol. 14, No. 6, November 2010.
[2] Faliu Yi, Inkyu Moon, Jeong-A Lee, Bahram Javidi, 'Fast
3D Computational Integral Imaging Using Graphics
Processing Unit', Journal of Display Technology, Vol. 8,
No. 12, December 2012.
[3] Fan Wu, Chung-han Cben, and Hira Narang, 'An Efficient
Acceleration of Symmetric Key Cryptography Using
General Purpose Grapbics Processing Unit', Proceedings of
the Fourth International Conference on Emerging Security
Infonnation, Systems and Technologies, pp. 228-233, 18-25
July 2010.
[4] G. M. Striemer and A. Akoglu, 'Sequence Alignment with
GPU: Perfonnance and Design Challenges' International
Symposium on Parallel & Distributed Processing IPDPS,
Rome, pp. 1-10,23-29 May 2009.
[5] K. Zhang and lU. Kang, 'Real-time numerical dispersion
compensation using graphics processing unit for Fourier
domain optical coherence tomography', published in Electronics Letters, Vol. 47, No. 5, pp. 309-310, March
2011.
[6] P. Harish and P. J. Narayanan, 'Accelerating Large Graph
Algoritbms on the GPU Using CUDA' Proceedings of the
14th international conference on High perfonnance
computing HiPC'07, pp. 197-208, December 2007.
[7] S. Ryoo, C. I. Rodrigues, S. S. Bagbsorkhi, S. S. Stone, D.
B. Kirk, and W. W. Hwu, 'Optimization Principles and
Application Performance Evaluation of a Multithreaded
GPU Using CUDA' Proceedings of the 13th ACM
SIGPLAN Symposium on Principles and Practice of
Parallel Programming, pp. 73-82, February 2008.
[8] Stephen Rosenzweig, Mark Palmeri, and Kathryn
Nightingale, 'GPU-Based Real-Time Small Displacement
Estimation With Ultrasound', in IEEE Transactions on
Ultrasonics, Ferroelectrics, and Frequency Control, Vol. 58,
No. 2, February 20 I I.
[9] V. Volkov and B. Kazian. Fitting FFT onto the G80
Architecture. pages 25-29, April 2006.
http://www.cs.berkeley.edu/kubitronlcourses/cs258-
S08/projects/reportsi project6-report.pdf.
[10] Wu-chun Feng and Shucai Xiao, 'To GPU Synchronize or
Not GPU Synchronize?' Proceedings of 2010 IEEE
International Symposium on Circuits and Systems (ISCAS),
Paris, pp. 380 I - 3804, 30 May - 02 June 20 I O. [11]Yixun Liu, Eddy Z. Zhang, Xipeng Shen 'A Cross-Input
Adaptive Framework for GPU Program Optimizations'
International Symposium on Parallel & Distributed
Processing IPDPS, Rome, pp. 1-10, 23-29 May 2009.
20J4 Internationai Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 161