CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
-
Upload
brandon-cain -
Category
Documents
-
view
221 -
download
2
Transcript of CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
![Page 1: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/1.jpg)
CS 395 Last Lecture
Summary, Anti-summary, and Final Thoughts
![Page 2: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/2.jpg)
2
Summary (1) Architecture
• Modern architecture designs are driven by energy constraints
• Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput
• Some parallelism is implicit (out-of-order superscalar processing,) but have limits
• Others are explicit (vectorization and multithreading,) and rely on software to unlock
![Page 3: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/3.jpg)
3
Summary (2) Memory
• Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other
• Locality (relationships between memory accesses) can help us get the best of all cases
• Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.)
![Page 4: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/4.jpg)
4
Summary (3) Software
• Want to fully occupy your hardware?– Express locality (tiling)– Vectorize (compiler or manual)– Multithread (e.g. OpenMP)– Accelerate (e.g. CUDA, OpenCL)
• Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free.
![Page 5: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/5.jpg)
5
Research Perspective (2010)
• Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations?– Across multiple architectures– Across many applications
• What kinds of performance trends are we seeing from successive GPU generations?
• Conclusion – GPUs aren’t special, and parallel programming is getting easier
![Page 6: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/6.jpg)
6
Application Survey
• Surveyed the GPU Computing Gems chapters• Studied the Parboil benchmarks in detail
Results: • Eight (for now) major categories of
optimization transformations– Performance impact of individual optimizations on
certain Parboil benchmarks included in the paper
![Page 7: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/7.jpg)
1: (Input) Data Access Tiling
7
DRAM
DRAM
Cache
DRAM
Scratchpad
ExplicitCopy
ImplicitCopy
LocalAccess
LocalAccess
![Page 8: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/8.jpg)
8
2. (Output) Privatization
• Avoid contention by aggregating updates locally
• Requires storage resources to keep copies of data structures
PrivateResults
LocalResults
GlobalResults
![Page 9: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/9.jpg)
9
Running Example: SpMV
Ax = v
Row
Data
Col
vx
A
![Page 10: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/10.jpg)
10
Running Example: SpMV
Ax = v
Row
Data
Col
A
vx
![Page 11: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/11.jpg)
11
3. “Scatter to Gather” Transformation
Ax = v v
Row
Data
Col
A
x
![Page 12: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/12.jpg)
12
3. “Scatter to Gather” Transformation
Ax = v v
Row
Data
Col
A
x
![Page 13: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/13.jpg)
13
4. Binning
A
![Page 14: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/14.jpg)
14
5. Regularization (Load Balancing)
![Page 15: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/15.jpg)
15
6. Compaction
![Page 16: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/16.jpg)
16
7. Data Layout Transformation
![Page 17: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/17.jpg)
17
7. Data Layout Transformation
![Page 18: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/18.jpg)
18
8. Granularity Coarsening• Parallel execution often requires redundant and
coordination work– Merging multiple threads into one allows reuse of result,
reducing redundancy
Essential
Redundant
4-wayparallel
2-wayparallel
Time
![Page 19: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/19.jpg)
How much faster do applications really get each hardware generation?
![Page 20: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/20.jpg)
20
Unoptimized Code Has Improved Drastically
• Orders of magnitude speedup in many cases
• Hardware does not solve all problems– Coalescing (lbm)– Highly contentious
atomics (bfs)
![Page 21: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/21.jpg)
21
Optimized Code Is Improving Faster than “Peak Performance”
• Caches capture locality scratchpad can’t efficiently (spmv, stencil)
• Increased local storage capacity enables extra optimization (sad)
• Some benchmarks need atomic throughput more than flops (bfs, histo)
![Page 22: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/22.jpg)
22
Optimization Still Matters• Hardware never
changes algorithmic complexity (cutcp)
• Caches do not solve layout problems for big data (lbm)
• Coarsening still makes a big difference (cutcp, sgemm)
• Many artificial performance cliffs are gone (sgemm, tpacf, mri-q)
![Page 23: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/23.jpg)
23
Stuff we haven’t covered
• Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters.
• Patterns and practice– Some of the major patterns of optimization we
covered, but only the basic ones. Many optimization patterns are algorithmic.
![Page 24: CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.](https://reader036.fdocuments.net/reader036/viewer/2022062518/56649e9c5503460f94b9d3d1/html5/thumbnails/24.jpg)
24
Fill Out Evaluations!