Application-level Scheduling Sathish S. Vadhiyar Credits / Sources: AppLeS web pages and papers.
List Ranking and Parallel Prefix Sathish Vadhiyar.
-
Upload
naomi-world -
Category
Documents
-
view
232 -
download
1
Transcript of List Ranking and Parallel Prefix Sathish Vadhiyar.
List Ranking and Parallel Prefix
Sathish Vadhiyar
List Ranking on GPUs Linked list prefix computations –
computations of prefix sum on the elements contained in a linked list
Linked list represented as an array Irregular memory accesses – successor
of each node of a linked list can be contained anywhere
List ranking – special case of list prefix computations in which all the values are identity, i.e., 1.
List ranking
L is a singly linked list Each node contains two fields – a data
field, and a pointer to the successor Prefix sums – updating data field with
summation of values of its predecessors and itself
L represented by an array X with fields X[i].prefix and X[i].succ
Sequential Algorithm
Simple and effective Two passes
Pass 1: To identify the head node Pass 2: Traverses starting from the head,
follow the successor nodes accumulating the prefix sums in the traversal order
Works well in practice
Parallel Algorithm: Prefix computations on arrays
Array X partitioned into subarrays Local prefix sums of each subarray
calculated in parallel Prefix sums of last elements of each
subarray written to a separate array Y Prefix sums of elements in Y are
calculated. Each prefix sum of Y is added to
corresponding block of X Divide and conquer strategy
Example
123456789
123 456 7891,3,6 4,9,15 7,15,24
6,15,246,21,45
1,3,6,10,15,21,28,36,45
Divide
Local prefix sum
Passing last elements to a processor
Computing prefix sum of last elements on the processor
Adding global prefix sum to local prefix sums in each processor
Prefix computation on list
The previous strategy cannot be applied here
Division of array X that represents list will lead to subarrays each of which can have many sublist fragments
Head nodes will have to be calculated for each of them
Parallel List Ranking (Wyllie’s algorithm)
Involved repeated pointer jumping Successor pointer of each element is
repeatedly updated so that it jumps over its successor until it reaches the end of the list
As each processor traverses and updates the successor, the ranks are updated
A process or thread is assigned to each element of the list
Parallel List Ranking (Wyllie’s algorithm)
Will lead to high synchronizations among threads
In CUDA - many kernel invocations
Parallel List Ranking (Helman and JaJa) Randomly select s nodes or splitters. The head
node is also a splitter Form s sublists. In each sublist, start from a
splitter as the head node, and traverse till another splitter is reached.
Form prefix sums in each sublist Form another list, L’, consisting of only these
splitters in the order they are traversed. The values in each entry of this list will be the prefix sum calculated in the respective sublists
Calculate prefix sums for this list Add these sums to the values of the sublists
Parallel List Ranking on GPUs: Steps
Step 1: Compute the location of the head of the list
Each of the indices between 0 and n-1, except head node, occur exactly only once in the successors.
Hence head node = n(n-1)/2 – SUM_SUCC SUM_SUCC = sum of the successor values Can be done on GPUs using parallel
reduction
Parallel List Ranking on GPUs: Steps
Step 2: Select s random nodes to split list into s random sublists
For every subarray of X of size X/s, select random location as a splitter.
Highly data parallel, can be done independent of each other
Parallel List Ranking on GPUs: Steps
Step 3: Using standard sequential algorithm, compute prefix sums of each sublist separately
The most computationally demanding step s sublists allocated equally among CUDA
blocks, and then allocated equally among threads in a block
Each thread computes prefix sums of each of its sublists, and copy prefix value of last element of sublist i to Sublist[i]
Parallel List Ranking on GPUs: Steps
Step 4: Compute prefix sum of splitters, where the successor of a splitter is the next splitter encountered when traversing the list
This list is small Hence can be done on CPU
Parallel List Ranking on GPUs: Steps
Step 5: Update values of prefix sums computed in step 3 using splitter prefix sums of step 4
This can be done using coalesced memory access – access by threads to contiguous locations
Choosing s
Large values of s increase the chance of threads dealing with equal number of nodes
However, too large values result in overhead of sublist creation and aggregation
Parallel Prefix on GPUs
Using binary tree An upward reduction phase (reduce phase
or up-sweep phase) Traversing tree from leaves to root forming
partial sums at internal nodes Down-sweep phase
Traversing from root to leaves using partial sums computed in reduction phase
Up Sweep
Down Sweep
Host Code int main(){ const unsigned int num_threads = num_elements / 2; /* cudaMalloc d_idata and d_odata */ cudaMemcpy( d_idata, h_data, mem_size,
cudaMemcpyHostToDevice) );
dim3 grid(256, 1, 1); dim3 threads(num_threads, 1, 1); scan<<< grid, threads>>> (d_odata, d_idata);
cudaMemcpy( h_data, d_odata[i], sizeof(float) * num_elements, cudaMemcpyDeviceToHost
/* cudaFree d_idata and d_odata */ }
Device Code__global__ void scan_workefficient(float *g_odata, float *g_idata, int n){ // Dynamically allocated shared memory for scan kernels extern __shared__ float temp[]; int thid = threadIdx.x; int offset = 1; // Cache the computational window in shared memory temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; // build the sum in place up the tree for (int d = n>>1; d > 0; d >>= 1) { __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; }
Device Code // scan back down the tree // clear the last element if (thid == 0) temp[n - 1] = 0; // traverse down the tree building the scan in place for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } } __syncthreads(); // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1];}
References
Fast and Scalable List Ranking on the GPU. ICS 2009.
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA. IPDPS 2010.