IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual...
-
Upload
magnus-gallagher -
Category
Documents
-
view
220 -
download
0
Transcript of IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual...
IIIT,
Hyd
erab
ad
Performance Primitives for Massive Multithreading
P J NarayananCentre for Visual Information Technology
IIIT, Hyderabad
IIIT,
Hyd
erab
ad
Lessons from GPU Computing
• Massively multithreaded: several thousands to millions of threads for good performance• Good performance depends on a lot– Resource utilization: shared memory, registers– Memory access: locality, arithmetic intensity• Optimum point may change with architecture– Retuning infeasible for every developer• Solution: Use standard libraries or primitives– Implemented well keeping the trade-offs in mind– Used by everyone: build your algorithms using them
IIIT,
Hyd
erab
ad
What are the primitives?
• Standard data-parallel primitives– scan, reduce– sort, split• But also:– segmented split– scatter, gather, data-copy– Transpose• Could have domain-specific primitives– Graph theory, numerical algorithms– Computer vision, Image processing
IIIT,
Hyd
erab
ad
Computing Using Primitives
• A typical program will/should have 75-80% of the work done through such primitives
• Application developer writes glue kernels to connect and clean up the components– Code for this simple and perhaps unchanging– Even inefficient implementations non-critical
• Example: A program with running time T uses primitives for 75% of operations. A new architecture doubles performance
• New running time: (with no speedup for non-primitive part)– 0.5 * (0.75 T) + 0.25 T = 0.625 T, instead of ideal 0.5 T.– 0.6 Tif 80% was using primitives and 0.55 T if 90%
IIIT,
Hyd
erab
ad
Primitive vs Library
• Both motivated by similar thinking: Reuse!• Primitive is typically an algorithmic step, which
finds diverse use– Used as a low-level step of an algorithm• A library function provides an end-to-end
functionality– Used to achieve a high-level functionality– Could be a “primitive” at a sufficiently high level!• Use a library if available. Avoids development even
using primitives!
IIIT,
Hyd
erab
ad
K-Means Clustering
• An iteration (with N vectors of d dimensions and K clusters)– Each vector finds distances to each
cluster center• O (N K d) operations
– Attach itself to the closest centre; take its label• O (N K) operations to find the minimum
distance
– Compute the mean of each cluster or vectors with the same label• O (N d) operations to find K means.
• GPU implementation of clustering of 128-dimensional SIFT vectors, a frequent problem in Computer Vision.
IIIT,
Hyd
erab
ad
SIFT Clustering
Problem: Cluster a few (4-8) million, (128 dimensional) SIFT vectors into a few (1-2) thousand clusters using K-Means• Representation: row major. That is, the N components
of each of the 128 dimensions stored together, tightly. (N rows of d each)
• Given: initial cluster means (could be random vectors)• Output: K cluster means and N labels, one for each
input vector giving cluster membership• Large amount of computations; well suited to a GPU-
like architecture
IIIT,
Hyd
erab
ad
Data Representation
1
2
N
1 2 3 d
Input Vectors in Row Major
1
2
K
1 2 d
Cluster Centers in Row Major
1 2 3 N
Cluster Labels
IIIT,
Hyd
erab
ad
Distance Computation
1. Loop over K clusters, loading c cluster centers to shared memory at a time
2. A block of t threads loops over all d components of t input vectors, loading component vi and accumulating (Ci – vi)2
3. Write distances in a K x N array, with K distances for a vector stored consecutively.
Shared memory used to the maximum and all memory accesses are perfectly coalesced.
IIIT,
Hyd
erab
ad
After Distance Evaluations
1
2
K
1 2 3 N
Vector to Cluster Distance Matrix
IIIT,
Hyd
erab
ad
Finding Closest Center
• We need to know the index of the centre that gave the minimum distance.• A block of t threads load t distances for a
particular centre. Keep track of the minimum distance and the corresponding index across the K centers.• Write index into a new labels array of length N.
All memory accesses are perfectly coalesced.
IIIT,
Hyd
erab
ad
New Cluster Centers
• The new labels are given in the input vector order.• Next step: Find the mean of all vectors with same
label. Find their sum first.• Rearrange input vectors so that vectors of each
category are placed together.• Column major storage makes the memory
accesses non-coalesced and inefficient.• Rearrange and convert to row major. Summing is
easy thereafter!
IIIT,
Hyd
erab
ad
Finding New Centers
1. gIndex = splitGatherIndex(new Labels)2. dCopy = gather(inputVectors, gIndex)3. temp = transpose(dCopy)4. Perform segmented add reduce of temp with segments at
label boundaries. Store results in an dx K array newCenters5. inputVectors = transpose(dCopy)6. centers = transpose(newCenters)
Now, input vectors are rearranged with new cluster centers. (Need to also keep track of a composition of gIndex values to maintain connection to input vectors)
IIIT,
Hyd
erab
ad
• Input : Input vectors, n, Cluster centers, dim, k• Output :New Membership array(n*1), New
cluster centers(k*dim), Global Index(n*1).
IIIT,
Hyd
erab
ad
Storage per Block
1
2
4
1 2 3 dim
Four Input Vectors
2.4 11 15 28 193.1
1 2 3 dim
Center on shared memory
3
Four input vectors loaded per block and their corresponding differences are stored in shared memory which consumes 2*2048 bytes of memory, also the center is on shared memory. on the difference we perform tree based addition for each vector.
IIIT,
Hyd
erab
ad
Algorithm Flow
• Perform distance evaluations between input and current centers to generate new membership array
• Apply split sort on membership array sorting as per cluster center ids.
• Create flag and perform segmented scan to get histogram for each cluster
• Rearrange data as per cluster ids • Perform transpose on rearranged data for coalesced access• Use CUDPP segmented scan on rearranged data followed by CUDPP
compact to extract the summation • Divide the summation by histogram generated for each cluster to
get new cluster centers• Update the global Index
IIIT,
Hyd
erab
ad
• The Global Index is initialized by Global Index[i]=i• After sorting the membership array, we have
sorted_membership_index[] i.e. the order in which vectors are supposed to be arranged• The sorted membership index after split sort is used to
get global index• Global Index[sorted_membership_index[i]] =i• In the final Global Index, i is the actual vector id of Input
vectors and Global Index[i] is the position of i’th vector id in the final rearranged input data.
IIIT,
Hyd
erab
ad
Distance Evaluation
• Sequential approach takes O(dim) steps • Simple tree based parallel approach• Takes O(log(dim) )steps to evaluate the net
distance • In a block only 256/2i threads are active during
i’th iteration of an distance evaluation• Effectively performed on the shared memory• Reduces the complexity by a factor of log
IIIT,
Hyd
erab
ad
Distance Evaluation
8
16
8
4 4 4 4
2 2 2 22 222
1
Tree based addition in log 8 steps
2
3
itr
IIIT,
Hyd
erab
ad
Algorithm for Distance evaluation
• Algorithm (Input: d_input, d_centers, dim, no_centers)for i=0 to no_centers do
shared[threadIdx.x]= (d_input[id]-d_centers[i])2
for j= dim/2 to 0 do If(threadIdx.x<j) then shared[threadIdx.x]+=shared[2*threadIdx.x+j] end ifj=j/2__syncthreads()end of inner for loop if min > shared[0]Min=shared[0] end if end of outer for loop
IIIT,
Hyd
erab
ad
Kernel Level Execution
• 2 • 2 • 2• 2 • 2
• 4 • 4 • 4
• 256
• 1
• 2
• Final iteration
• 1 • 2 • Dim =128
• Every iteration number of active threads reduce by a factor of 2
Threads Id
128 128
IIIT,
Hyd
erab
ad
Kernel Functions
• Distance – Evaluates the distance between vectors (block 128,4, grid n/4p,p)
• Get_long_membership – creates a variable of type long consisting membership id and corresponding vector id.
• SplitSort – Sorts membership array as per cluster ids• CUDPPSegmented Scan – Scan operation on sorted membership
array• Get_flag – Generate flag for CUDPP operations(block 256,1)• Gather_histogram – Gathers the final values after scan• Rearrange_data – Arrange input as per clusters ids (block 128,4,
grid n/4p,p)• Transpose – Performs transpose on rearranged data• CUDPPCompact – Extracts summed up center values
IIIT,
Hyd
erab
ad
Rearranging data
1
2
N
1 2 3 d
Input Vectors in Row Major
1 1 2 k kk
45 59 89 23
4559
23
1 2 3 d
Rearranged Vectors in Row Major based on
Sorted Membership array
Vec id
IIIT,
Hyd
erab
ad
Center Evaluation
45
59
23
1 2 3 dim
Rearranged Input Vectors
1
2
dim
45 59 23
Transposed Vectors
Vec Id
Vec ID
We may apply segmented scan on transposed vectors which is a coalesced operation, flag values can be got with the help of histograms generated for each cluster.
IIIT,
Hyd
erab
ad
Global Index
• 1 • 2 • 52
• 3
• 52• 5
7• 1• 1
9• 4
9• 5
7• 8
9
• 1
• 2
• Final iteration
• 1 • 2 • n
• Updating the global Index array after every iteration• Global Index[membership_sorted_index[i]] =i
Vector Id
IIIT,
Hyd
erab
ad
Why use Split Sort, Transpose?
• New centers evaluation requires concurrent writes which is not easily parallelizable• Sorts membership array grouping vector ids
belonging to same cluster together• Helpful for rearranging entire input vectors as
per their clusters • Transpose provides coalesced access for
center evaluation using segmented scan
IIIT,
Hyd
erab
ad
Issues
• Major time is consumed by distance evaluations as input size increases.• Input size and number of clusters majorly
control the performance
IIIT,
Hyd
erab
ad
Result
• Kmeans++ to generate initial centers• Time taken to generate initial cluster centers
Input size Cluster centers CPU (P4, 2.4Ghz) GPU( GTX 280)
1,000 80 4480 ms 12.177 ms
10,000 800 39341.2 ms 670.06 ms
1,00,000 8000 897326.5 ms 62547.035 ms
1 Million 80000 9943472.8 ms 126392.1 ms
IIIT,
Hyd
erab
ad
Results
• Variation with number of input vectors (128 dimension)• Time taken per iteration to generate new membership array
and new cluster centers (excluding time for kmeans++)
Input size Cluster centers CPU (P4 , 2.4Ghz) GPU( GTX 280)
1,000 80 370 ms 9.91 ms
10,000 800 82900 ms 487.3 ms
1,00,000 8000 679923.1 ms 36623.58 ms
1 Million 8000 5189450.4 ms 45789.29 ms
IIIT,
Hyd
erab
ad
Result
• Variation of cluster centers• N = 10000, Dimension =128
Input size Cluster centers GPU( GTX 280)
1,00,000 500 2188.91 ms
1,00,000 1000 4486.37 ms
1,00,000 2000 9241.58 ms
1,00,000 4000 18419.71 ms
1,00,000 8000 36623.58 ms
IIIT,
Hyd
erab
ad
Result
• Variation with dimension of SIFT vector• N = 10000, Cluster centers =8000• Input size Dimension GPU( GTX 280)
1,00,000 16 118.91 ms
1,00,000 32 997.3 ms
1,00,000 64 8623.58 ms
1,00,000 128 36623.58 ms
IIIT,
Hyd
erab
ad
Result
• Coalesced vs Non-Coalesced• Coalesced involves transpose followed by segmented scan and
non- coalesced involves gather followed by segmented scan
Input size – Cluster centers
Non Coalesced Coalesced
1,000 – 80 0.043 ms 0.077 ms
10,000 – 800 1.28 ms 0.217 ms
1,00,000 -8000 15.45 ms 1.955 ms
1,00,000 - 80000 83.24 ms 19.343 ms
IIIT,
Hyd
erab
ad
Result
• Membership vs New Centers• The membership generation consumes major chunk of time
Input size – Cluster centers
Membership New centers
1,000 – 80 4.23 ms 5.68 ms
10,000 – 800 369.07 ms 118.23 ms
1,00,000 -8000 36465.2 ms 158.38 ms
10,00,000 - 8000 45559.48 ms 229.83 ms