[IEEE 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques...

3
EFFICIENT PARALLEL PROCESSING BY IMPROVED CPU-GPU INTERACTION Harsh Khatter Department of Computer Science and Engineering ABES Engineering College, Ghaziabad, India harsh _ khatter@yahoo.com Abstract: In this digital world, more than 90% of desktop and notebook computers have integrated Graphics Processing Units i.e. GPU's, for beer graphics processing. Graphics Processing Unit is not only for graphics appcations, even for non- graphics appcations too. In the past few years, the graphics programmable processor has evolved into an increasingly convincing computational resource. But GPU sits idle graphics job queue is emp, which decreases the GPU's ency. This paper focuses on various tact to overcome this problem and to make the CPU-GPU processing more poweul and efficient The graphics programmable processor or Graphics processing unit is especiay we suited to address problem sets expressed as data parael computation with the same program ecuted on many data elements concurrently. The objective of this paper is to increase the capabities and flibi of recent GPU hardware combined with high level GPU programming languages: to accelerate the building of images in a frame buffer intended for output to a display, and, to provide tremendous acceleration r numericay intensive scienc appcations. This paper also gives some ght on major appcative areas where GPU is in use and where GPU can be used in ture. 1. INTRODUCTION Over past few years, Graphics processing unit has rapidly come into existence. As the name is saying, Graphics processing unit (GPU) is not only for graphics processing, but also for highly parallel programmable tasks. GPU has a capability to change itself as per the task and processing requirement. Recent years have seen a trend in using graphic processing units (GPU) as accelerators for general-puose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a high-quality GPU application is challenging, due to the large optimization space and complex unpredictable effects of optimizations on GPU program performance [11]. Feng and Xiao [10] stated that the graphics processing unit (GPU) has evolved om being a fixed -nction processor with programmable stages into a programmable processor with many fixed-function components that deliver massive parallelism. By modifying the GPU's stream processor to support "general-puose computation" on the GPU (GPGPU), applications that perform massive vector operations can realize many orders-of-magnitude improvement in performance over a traditional processor, i.e. CPU. 978-1-4799-2900-9/14/$31.00 ©2014 IEEE 159 Vaishali Aggarwal Department of Computer Science and Engineering KIET, Ghaziabad, India sweetvish08 @gmail.com Originally, the massive parallelism offered by the GPU only supported calculations for 3D computer graphics, such as texture mapping, polygon rendering, vertex rotation and translation, and oversampling and inteolation to reduce aliasing. However, because many of these graphics computations entail matrix and vector operations, the GPU is also increasingly being used for non-graphical calculations. GPUs typically map well only to data-parallel or task- parallel applications whose execution requires relatively minimal communication between streaming multiprocessors (SMs) on the GPU [4], [6], [7], [9]. In general case, GPU comprises of massive parallelism with hundred of cores in it. Thousands of threads are involve in its processing whose execution is cheaper than the CPU's execution. A special programming methodology is available for GPU's named as CUDA. There is a myth is mind of persons that GPU is only designed for the graphics processing and graphics tasks. Basically, GPU is designed for a particular class of applications with the following characteristics. Over the past decade, a new era has identified other applications with similar characteristics and successl mapped these applications onto the GPU. Computational requirement is large. Parallelism is substantial. Throughput is more important than latency. 2. GPU APPLICATIVE AREAS: A REVIEW 2.1 Traditional Apps: Graphics and Gaming Initially, there were normal VGA video cards for better graphics and resolution in personal computers. With time, VGA got more power to process graphics and adjust colours. Further, It evolved into the better graphics processing engines, named Graphics processing units, GPu. From last decade every laptop has its in-built GPU along with cpu. Most common use of GPU is in Gaming for better visualization. In addition, higher resolution support is also possible due to GPU processors. But, the problem was GPU sits idle when there is no job for graphics processing, so the apps are changed a bit, GPU is treating like another CPU. Modem application works on this phenomenon.

Transcript of [IEEE 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques...

EFFICIENT PARALLEL PROCESSING BY IMPROVED CPU-GPU

INTERACTION

Harsh Khatter Department of Computer Science and Engineering

ABES Engineering College, Ghaziabad, India harsh _ [email protected]

Abstract: In this digital world, more than 90% of desktop and

notebook computers have integrated Graphics Processing Units

i.e. GPU's, for better graphics processing. Graphics Processing

Unit is not only for graphics applications, even for non­

graphics applications too. In the past few years, the graphics

programmable processor has evolved into an increasingly

convincing computational resource. But GPU sits idle if

graphics job queue is empty, which decreases the GPU's

ejficiency. This paper focuses on various tact to overcome this

problem and to make the CPU-GPU processing more powerful

and efficient. The graphics programmable processor or

Graphics processing unit is especially well suited to address

problem sets expressed as data parallel computation with the

same program executed on many data elements concurrently.

The objective of this paper is to increase the capabilities and

flexibility of recent GPU hardware combined with high level

GPU programming languages: to accelerate the building of

images in a frame buffer intended for output to a display, and,

to provide tremendous acceleration jor numerically intensive

scientific applications. This paper also gives some light on

major applicative areas where GPU is in use and where GPU

can be used in future.

1. INTRODUCTION

Over past few years, Graphics processing unit has rapidly

come into existence. As the name is saying, Graphics

processing unit (GPU) is not only for graphics processing,

but also for highly parallel programmable tasks. GPU has a

capability to change itself as per the task and processing

requirement.

Recent years have seen a trend in using graphic processing

units (GPU) as accelerators for general-purpose computing. The inexpensive, single-chip, massively

parallel architecture of GPU has evidentially brought

factors of speedup to many numerical applications.

However, the development of a high-quality GPU

application is challenging, due to the large optimization

space and complex unpredictable effects of optimizations

on GPU program performance [11].

Feng and Xiao [10] stated that the graphics processing unit

(GPU) has evolved from being a fixed -function processor

with programmable stages into a programmable processor

with many fixed-function components that deliver massive

parallelism. By modifying the GPU's stream processor to

support "general-purpose computation" on the GPU (GPGPU), applications that perform massive vector

operations can realize many orders-of-magnitude

improvement in performance over a traditional processor,

i.e. CPU.

978-1-4799-2900-9/14/$31.00 ©2014 IEEE 159

Vaishali Aggarwal Department of Computer Science and Engineering

KIET, Ghaziabad, India [email protected]

Originally, the massive parallelism offered by the GPU only supported calculations for 3D computer graphics,

such as texture mapping, polygon rendering, vertex

rotation and translation, and oversampling and

interpolation to reduce aliasing. However, because many of

these graphics computations entail matrix and vector

operations, the GPU is also increasingly being used for

non-graphical calculations.

GPUs typically map well only to data-parallel or task­

parallel applications whose execution requires relatively

minimal communication between streaming

multiprocessors (SMs) on the GPU [4], [6], [7], [9].

In general case, GPU comprises of massive parallelism

with hundred of cores in it. Thousands of threads are

involve in its processing whose execution is cheaper than

the CPU's execution. A special programming methodology

is available for GPU's named as CUDA.

There is a myth is mind of persons that GPU is only

designed for the graphics processing and graphics tasks.

Basically, GPU is designed for a particular class of

applications with the following characteristics. Over the

past decade, a new era has identified other applications

with similar characteristics and successful mapped these

applications onto the GPU.

• Computational requirement is large. • Parallelism is substantial. • Throughput is more important than latency.

2. GPU APPLICA TIVE AREAS: A REVIEW

2.1 Traditional Apps: Graphics and Gaming

Initially, there were normal VGA video cards for better

graphics and resolution in personal computers. With time,

VGA got more power to process graphics and adjust colours. Further, It evolved into the better graphics

processing engines, named Graphics processing units,

GPu. From last decade every laptop has its in-built GPU

along with cpu. Most common use of GPU is in Gaming

for better visualization. In addition, higher resolution

support is also possible due to GPU processors.

But, the problem was GPU sits idle when there is no job

for graphics processing, so the apps are changed a bit,

GPU is treating like another CPU. Modem application

works on this phenomenon.

160

2.2 Modern Applications

Now-a-days, advance GPU's deliver a high compute

capacity at low cost, making them attractive for use in scientific computing. Major ones are listed below:

2.2.1 Parallel Computing: • Specialized data-parallel computing is a special

feature of advance GPU computing where CPU and

GPU work in parallel. Numbers of application are based on this data parallelism approach.

• Dan Chen et al. proposed a parallelized EEMD

method that has been developed using general­purpose computing on the graphics processing unit

(GPGPU), namely, G-EEMD. Spectral entropy

facilitated by G-EEMD was proposed to analyse the EEG data for estimating the depth of anaesthesia

(DoA) in a real-time manner [1].

GPU is not only restricted to the analytical purpose or

theoretical work. Numbers of real time applications are based on GPU processing.

2.2.2 Real Time Applications: • Stephen Rosen et al. elaborated their work to develop

a GPU-based real- time acoustic radiation force I pulse (ARFI) data processing system for use in

clinical feasibility studies of ARFI imaging. The two principle computational steps required to achieve this goal are data interpolation and displacement

estimation [8]. • K. Zhang and J.u. Kang have implemented

Numerical dispersion compensation for both standard

and Full-range Fourier domain optical cohesion tomography on the GPU architecture [5]. In terms of GPU architecture, GPI is built for different

application demands than the CPU: large, parallel computation requirements with an emphasis on

throughput rather than latency. • GPU also works well in cryptography like encryption

decryption methods. On one side where CPU is

doing encryption, GPU works parallel in Key

generation algorithms. Similarly, GPU works well in other tasks parallel to CPU to enhance the

performance of CPU. This is also utilising the GPU power when GPU is sitting idle. Applications are not restrict only to the mentioned domains. Infinite

numbers of areas are using GPU and CPU together for better working environment.

3. CPU-GPU INTERACTION

GPU processing is completely handled by the CPU.

Perhaps, the working of both can be different. CPU usually assigns the tasks to the GPU. Data parallelism is also an

example of this collaboration of CPU-GPU processing. If, we will put some light on general CPU-GPU Analogies, CPU has inner loops whereas GPU based on kernel part.

CPU works on streams or data arrays and GPU on texture. CPU performs memory read operations and GPU texture

samples. Major one is to promote data parallelism using CPU-GPU interaction. The most recent CPUs have upto 8-

12 cores, while the GPUs possess hundreds or even thousands of stream processor cores [2].

The GPU has several key advantages over CPU

architectures for highly parallel, compute intensive workloads, including higher memory bandwidth,

significantly higher floating-point throughput, and

thousands of hardware thread contexts with hundreds of parallel compute pipelines executing programs in a SIMD fashion. The GPU provides parallel processing of large quantities of data relative to what can be provided by a

general CPU. The GPU can be an attractive altemative to CPU clusters in high performance computing

environments [3].

Now, question is how GPU involve in achieving data

parallelism? So, there is a list of few points which ensures

importance of GPU in Data Parallelism

• GPUs are designed for graphics and working towards

highly parallel tasks • GPUs process independent vertices & fragments.

Temporary registers are zeroed and No shared or static

data is used. There are no read-modify-write buffers. Only parallel work is on which achieves Data-parallel

processing. • GPUs architecture is ALU based which consists of

multiple vertex & pixel pipelines, multiple ALUs per

pipe. • GPU also Hides memory latency and with more

computation

3.1 Constraints in CPU-GPU computing

As working with CPU and GPU together, there are number of constraints which are observed. List of constraints is

given below:

• High latency between CPU-GPU • Handling control flow graphs • I/O access • Bit operations • Limited data structure

Based on these constraints, the possible number of solutions is proposed in the next section.

4. PROPOSED SOLUTION TO BRIDGE GAP

GPU is specially designed for graphics computing, but apart from this, it is also capable to do other general

purpose tasks too. But, it is based on various parameters

that which task will be performed by CPU, and which will be done by GPU. This is a problem to judge all. This is

done using decision making by using thresholds. Some

solutions based on categories are discussed below:

• Based on burst time, communication time and

selection:

To keep a watch on the latency threshold, the burst

time of task/process has to be noticed and compared

with the threshold. For example, threshold is Communication cost, C, then GPU will be selected if

the burst time is greater than the twice of the communication cost between CPU and GPU otherwise

task will be processed by CPU only.

2014 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT)

• Based on 110 access and memory access:

GPU should have a direct access to 110 and memory so that it doesn't require CPU to interact with peripherals.

In collaborative tasks priority must be set CPU, i.e. CPU is on higher priority than GPU for 110 access.

• Based on the tasks:

Small and user defined tasks will be processed by

GPU, while system processes will be taken by CPU.

• Bulky tasks :

Bulky tasks can be divided equally for CPU & GPU processing as in parallel computing happens

• Based on resources:

Based on the resource allocation and requirement of resources by tasks, GPU will take less resource

requirement tasks while CPU takes other ones.

Still some revision in required to the architecture of GPU, so that it will also has decision making capability of

distributing the tasks and works among other cores. Now, GPU has a cache, GPU has cores, GPU can also perform other specific tasks more easily than CPU. But, GPU is

especially for Graphics and low tasks. As, i7 4th generation is trying to merge up GPU and CPU, there is a requirement to make GPU as capable as Cpu. Some demerits of

merging up CPU-GPU, memory is same, cache is same, load will take only by one, if graphics are bulking, it will stop normal processing work too, in GPU architecture

lOOO s of cores are there, which can easily cope up with the graphics and the image parameters, but in CPU, we have at most 6 cores now. Therefore, it's not a solution to merge

CPU and GPU on a single hybrid processor.

5. CONCLUSION

Focus of paper is on the improved CPU-GPU interaction for efficient parallel computing. How this hybrid approach results to parallel computing. The parametric constraints where still needs revisions for efficient parallel processing. Paper has discussed this in brief with some suitable solutions for choosing the processing units i.e. CPU or

GPU.

The number of applicative areas are discussed where GPU­CPU is working well. Numbers of parameters are

discussed in paper, which are reflecting the tasks distribution among the CPU and GPU for improved and efficient computing. Paper also highlighted various

constraints of GPU and how it can be overcome. Further the working will also be improved by programming perspective change. Ensuing algorithms on CUDA can also

improve the efficiency of CPU-GPU processing. This collaboration of CPU-GPU will surely enhance the

efficiency of processing tasks.

REFERENCES

[I] Dan Chen, Duan Li, Muzhou Xiong, Hong Bao, and Xiaoli

Li, 'GPGPU-Aided Ensemble Empirical-Mode

Decomposition for EEG Analysis during Anesthesia', IEEE

Transactions on Information Technology in Biomedicine,

Vol. 14, No. 6, November 2010.

[2] Faliu Yi, Inkyu Moon, Jeong-A Lee, Bahram Javidi, 'Fast

3D Computational Integral Imaging Using Graphics

Processing Unit', Journal of Display Technology, Vol. 8,

No. 12, December 2012.

[3] Fan Wu, Chung-han Cben, and Hira Narang, 'An Efficient

Acceleration of Symmetric Key Cryptography Using

General Purpose Grapbics Processing Unit', Proceedings of

the Fourth International Conference on Emerging Security

Infonnation, Systems and Technologies, pp. 228-233, 18-25

July 2010.

[4] G. M. Striemer and A. Akoglu, 'Sequence Alignment with

GPU: Perfonnance and Design Challenges' International

Symposium on Parallel & Distributed Processing IPDPS,

Rome, pp. 1-10,23-29 May 2009.

[5] K. Zhang and lU. Kang, 'Real-time numerical dispersion

compensation using graphics processing unit for Fourier­

domain optical coherence tomography', published in Electronics Letters, Vol. 47, No. 5, pp. 309-310, March

2011.

[6] P. Harish and P. J. Narayanan, 'Accelerating Large Graph

Algoritbms on the GPU Using CUDA' Proceedings of the

14th international conference on High perfonnance

computing HiPC'07, pp. 197-208, December 2007.

[7] S. Ryoo, C. I. Rodrigues, S. S. Bagbsorkhi, S. S. Stone, D.

B. Kirk, and W. W. Hwu, 'Optimization Principles and

Application Performance Evaluation of a Multithreaded

GPU Using CUDA' Proceedings of the 13th ACM

SIGPLAN Symposium on Principles and Practice of

Parallel Programming, pp. 73-82, February 2008.

[8] Stephen Rosenzweig, Mark Palmeri, and Kathryn

Nightingale, 'GPU-Based Real-Time Small Displacement

Estimation With Ultrasound', in IEEE Transactions on

Ultrasonics, Ferroelectrics, and Frequency Control, Vol. 58,

No. 2, February 20 I I.

[9] V. Volkov and B. Kazian. Fitting FFT onto the G80

Architecture. pages 25-29, April 2006.

http://www.cs.berkeley.edu/kubitronlcourses/cs258-

S08/projects/reportsi project6-report.pdf.

[10] Wu-chun Feng and Shucai Xiao, 'To GPU Synchronize or

Not GPU Synchronize?' Proceedings of 2010 IEEE

International Symposium on Circuits and Systems (ISCAS),

Paris, pp. 380 I - 3804, 30 May - 02 June 20 I O. [11]Yixun Liu, Eddy Z. Zhang, Xipeng Shen 'A Cross-Input

Adaptive Framework for GPU Program Optimizations'

International Symposium on Parallel & Distributed

Processing IPDPS, Rome, pp. 1-10, 23-29 May 2009.

20J4 Internationai Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT) 161