Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query...
Transcript of Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query...
![Page 1: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/1.jpg)
Go Wrap upParallel Architectures
Chris Rossbach
cs378 Fall 2018
10/15/2018
![Page 2: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/2.jpg)
Outline for Today• Questions?
• Administrivia
• Agenda• Go
• Parallel Architectures (GPU background)
• Rob Pike’s 2012 Go presentation is excellent, and I borrowed from it: https://talks.golang.org/2012/concurrency.slide
![Page 3: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/3.jpg)
Faux Quiz questions
• How are promises and futures different or the same as goroutines
• What is the difference between a goroutine and a thread?
• What is the difference between a channel and a lock?
• How is a channel different from a concurrent FIFO?
• What is the CSP model?
• What are the tradeoffs between explicit vs implicit naming in message passing?
• What are the tradeoffs between blocking vs. non-blocking send/receive in a shared memory environment? In a distributed one?
• What is hardware multi-threading; what problem does it solve?
• What is the difference between a vector processor and a scalar?
• Implement a parallel scan or reduction
• How are GPU workloads different from GPGPU workloads?
• How does SIMD differ from SIMT?
• List and describe some pros and cons of vector/SIMD architectures.
• GPUs historically have elided cache coherence.Why? What impact does it have on the the programmer?
• List some ways that GPUs use concurrency but not necessarily parallelism.
![Page 4: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/4.jpg)
Google Search
• Workload:
• Accept query
• Return page of results (with ugh, ads)
• Get search results by sending query to • Web Search• Image Search• YouTube• Maps• News, etc
• How to implement this?
![Page 5: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/5.jpg)
Search 1.0
• Google function takes query and returns a slice of results (strings)
• Invokes Web, Image, Video search serially
![Page 6: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/6.jpg)
Search 2.0
• Run Web, Image, Video searches concurrently, wait for results
• No locks, conditions, callbacks
![Page 7: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/7.jpg)
Search 2.1
• Don’t wait for slow servers: No locks, conditions, callbacks!
![Page 8: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/8.jpg)
Search 3.0
• Reduce tail latency with replication. No locks, conditions, callbacks!
![Page 9: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/9.jpg)
Go: magic? …or threadpools and concurrent Qs?
• We’ve seen several abstractions for • Control flow/exection
• Communication
• Lots of discussion of pros and cons
• Ultimately still CPUs + instructions
• Go: just sweeping issues under the language interface?• Why is it OK to have 100,000s of goroutines?
• Why isn’t composition an issue?
![Page 10: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/10.jpg)
Go implementation details
![Page 11: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/11.jpg)
Go implementation details
• M = “machine” OS thread
![Page 12: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/12.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
![Page 13: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/13.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
![Page 14: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/14.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
![Page 15: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/15.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
• Goroutine scheduling is cooperative• Switch out on complete or block
• Very light weight (fibers!)
• Scheduler does work-stealing
![Page 16: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/16.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
• Goroutine scheduling is cooperative• Switch out on complete or block
• Very light weight (fibers!)
• Scheduler does work-stealing
![Page 17: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/17.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
• Goroutine scheduling is cooperative• Switch out on complete or block
• Very light weight (fibers!)
• Scheduler does work-stealing
![Page 18: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/18.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
• Goroutine scheduling is cooperative• Switch out on complete or block
• Very light weight (fibers!)
• Scheduler does work-stealing
![Page 19: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/19.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
• Goroutine scheduling is cooperative• Switch out on complete or block
• Very light weight (fibers!)
• Scheduler does work-stealing
![Page 20: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/20.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
• Goroutine scheduling is cooperative• Switch out on complete or block
• Very light weight (fibers!)
• Scheduler does work-stealing
![Page 21: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/21.jpg)
Go implementation details
• M = “machine” OS thread
• P = (processing) context
• G = goroutines
• Each ‘M’ has a queue of goroutines
• Goroutine scheduling is cooperative• Switch out on complete or block
• Very light weight (fibers!)
• Scheduler does work-stealing
![Page 22: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/22.jpg)
func testQ(consumers int) {startTimes["testQ"] = time.Now()var wg sync.WaitGroupwg.Add(consumers)ch := make(chan int)for i:=0; i<consumers; i++ {
go func(id int) {aval, amore := <- chif(amore) {
info("reader #%d got %d value\n", id, aval)} else {
info("channel reader #%d terminated with nothing.\n", id)}wg.Done()
}(i)}time.Sleep(1000 * time.Millisecond)close(ch)wg.Wait()stopTimes["testQ"] = time.Now()
}
1000s of go routines?
![Page 23: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/23.jpg)
func testQ(consumers int) {startTimes["testQ"] = time.Now()var wg sync.WaitGroupwg.Add(consumers)ch := make(chan int)for i:=0; i<consumers; i++ {
go func(id int) {aval, amore := <- chif(amore) {
info("reader #%d got %d value\n", id, aval)} else {
info("channel reader #%d terminated with nothing.\n", id)}wg.Done()
}(i)}time.Sleep(1000 * time.Millisecond)close(ch)wg.Wait()stopTimes["testQ"] = time.Now()
}
1000s of go routines? • Creates a channel• Creates “consumers” goroutines• Each of them tries to read from the channel• Main either:
• Sleeps for 1 second, closes the channel• sends “consumers” values
![Page 24: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/24.jpg)
func testQ(consumers int) {startTimes["testQ"] = time.Now()var wg sync.WaitGroupwg.Add(consumers)ch := make(chan int)for i:=0; i<consumers; i++ {
go func(id int) {aval, amore := <- chif(amore) {
info("reader #%d got %d value\n", id, aval)} else {
info("channel reader #%d terminated with nothing.\n", id)}wg.Done()
}(i)}time.Sleep(1000 * time.Millisecond)close(ch)wg.Wait()stopTimes["testQ"] = time.Now()
}
1000s of go routines? • Creates a channel• Creates “consumers” goroutines• Each of them tries to read from the channel• Main either:
• Sleeps for 1 second, closes the channel• sends “consumers” values
![Page 25: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/25.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 26: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/26.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 27: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/27.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
Race detection! Cool!
![Page 28: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/28.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 29: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/29.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 30: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/30.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 31: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/31.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 32: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/32.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 33: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/33.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
![Page 34: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/34.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights
Transputers did this in hardware in the 90s btw.
![Page 35: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/35.jpg)
Channel implementation
• You can just read it:• https://golang.org/src/runtime/chan.go
• Some highlights:• Race detection built in
• Fast path just write to receiver stack
• Often has no capacity scheduler hint!
• Buffered channel implementation fairly standard
![Page 36: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/36.jpg)
![Page 37: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/37.jpg)
A modern GPU: Volta V100
![Page 38: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/38.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor
![Page 39: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/39.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor
Also: CU or ACE
![Page 40: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/40.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor
![Page 41: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/41.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
![Page 42: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/42.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
Roughly: all of k-means
1,000s X/sec
![Page 43: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/43.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
![Page 44: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/44.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
• 640 Tensor cores
![Page 45: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/45.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
• 640 Tensor cores
• HBM2 memory• 4096-bit bus• No cache coherence!
![Page 46: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/46.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
• 640 Tensor cores
• HBM2 memory• 4096-bit bus• No cache coherence!
• 16 GB memory• PCIe-attached
![Page 47: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/47.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
• 640 Tensor cores
• HBM2 memory• 4096-bit bus• No cache coherence!
• 16 GB memory• PCIe-attached
![Page 48: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/48.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
• 640 Tensor cores
• HBM2 memory• 4096-bit bus• No cache coherence!
• 16 GB memory• PCIe-attached
![Page 49: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/49.jpg)
A modern GPU: Volta V100• 80 SMs
• Streaming Multiprocessor• 64 cores/SM• 5210 threads!• 15.7 TFLOPS
• 640 Tensor cores
• HBM2 memory• 4096-bit bus• No cache coherence!
• 16 GB memory• PCIe-attached
How do you program a machine like this? pthread_create()?
![Page 50: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/50.jpg)
GPUs: Outline
• Background from many areas• Architecture
• Vector processors• Hardware multi-threading
• Graphics• Graphics pipeline• Graphics programming models
• Algorithms• parallel architectures parallel algorithms
• Programming GPUs• CUDA• Basics: getting something working• Advanced: making it perform
![Page 51: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/51.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true)
do_next_instruction();
}
![Page 52: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/52.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true)
do_next_instruction();
}
do_next_instruction() {instruction = fetch();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
![Page 53: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/53.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true)
do_next_instruction();
}
do_next_instruction() {instruction = fetch();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
main() { pthread_create(do_instructions);pthread_create(do_decode);pthread_create(do_execute);…pthread_join(…);…
}
![Page 54: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/54.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true)
do_next_instruction();
}
do_instructions() {while(true) {
instruction = fetch();enqueue(DECODE, instruction);
}}
do_decode() {while(true) {
instruction = dequeue();ops, regs = decode(instruction); enqueue(EX, instruction);
}}
do_execute() {while(true) {
instruction = dequeue();execute_calc_addrs(ops, regs);enqueue(MEM, instruction);
}}
….
do_next_instruction() {instruction = fetch();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
main() { pthread_create(do_instructions);pthread_create(do_decode);pthread_create(do_execute);…pthread_join(…);…
}
![Page 55: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/55.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
![Page 56: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/56.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
![Page 57: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/57.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
![Page 58: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/58.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
![Page 59: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/59.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
What is the name of this kind of parallelism?
![Page 60: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/60.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
What is the name of this kind of parallelism?
Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?
![Page 61: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/61.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
What is the name of this kind of parallelism?
How can we get *more* parallelism?
Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?
![Page 62: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/62.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
What is the name of this kind of parallelism?
How can we get *more* parallelism?
Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?
![Page 63: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/63.jpg)
Architecture Review: PipelinesProcessor algorithm:
main() {
while(true) {
do_next_instruction();
}
do_next_instruction() {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
What is the name of this kind of parallelism?
How can we get *more* parallelism?
Works well if pipeline is kept fullWhat kinds of things cause “bubbles”/stalls?
![Page 64: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/64.jpg)
Multi-core/SMPs
![Page 65: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/65.jpg)
Multi-core/SMPs
![Page 66: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/66.jpg)
Multi-core/SMPs
![Page 67: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/67.jpg)
Multi-core/SMPs
![Page 68: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/68.jpg)
Multi-core/SMPs
![Page 69: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/69.jpg)
Multi-core/SMPsmain() {
for(i=0; i<CORES; i++) {
pthread_create(
do_instructions());
}
}do_instructions() {
while(true) {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}}
![Page 70: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/70.jpg)
Multi-core/SMPsmain() {
for(i=0; i<CORES; i++) {
pthread_create(
do_instructions());
}
}do_instructions() {
while(true) {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}}
• Pros: Simple• Cons: programmer has to find the parallelism!
![Page 71: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/71.jpg)
Multi-core/SMPsmain() {
for(i=0; i<CORES; i++) {
pthread_create(
do_instructions());
}
}do_instructions() {
while(true) {
instruction = fetch();
ops, regs = decode(instruction);
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}}Other techniques extract
parallelism here, try to let the machine find parallelism
• Pros: Simple• Cons: programmer has to find the parallelism!
![Page 72: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/72.jpg)
Superscalar processors
![Page 73: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/73.jpg)
Superscalar processorsRemove extra
instruction streams
![Page 74: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/74.jpg)
Superscalar processors
![Page 75: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/75.jpg)
Superscalar processors
![Page 76: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/76.jpg)
Superscalar processors main() {for(i=0; i<CORES; i++)
pthread_create(decode_exec);while(true) {
instruction = fetch();enqueue(instruction);
}}
decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
![Page 77: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/77.jpg)
Superscalar processors main() {for(i=0; i<CORES; i++)
pthread_create(decode_exec);while(true) {
instruction = fetch();enqueue(instruction);
}}
decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
Doesn’t look that different does it? Why do it?
![Page 78: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/78.jpg)
Superscalar processors main() {for(i=0; i<CORES; i++)
pthread_create(decode_exec);while(true) {
instruction = fetch();enqueue(instruction);
}}
decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
Doesn’t look that different does it? Why do it?
Enables independent instruction parallelism.
![Page 79: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/79.jpg)
Superscalar processors main() {for(i=0; i<CORES; i++)
pthread_create(decode_exec);while(true) {
instruction = fetch();enqueue(instruction);
}}
decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
Doesn’t look that different does it? Why do it?
Enables independent instruction parallelism.
![Page 80: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/80.jpg)
Superscalar processors main() {for(i=0; i<CORES; i++)
pthread_create(decode_exec);while(true) {
instruction = fetch();enqueue(instruction);
}}
decode_exec() {instruction = dequeue();ops, regs = decode(instruction);execute_calc_addrs(ops, regs);access_memory(ops, regs);write_back(regs);
}
Doesn’t look that different does it? Why do it?
independent
Enables independent instruction parallelism.
![Page 81: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/81.jpg)
Vector/SIMD processors
![Page 82: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/82.jpg)
Vector/SIMD processors
![Page 83: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/83.jpg)
Vector/SIMD processorsWhy decode same instruction
sequence over and over?
![Page 84: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/84.jpg)
Vector/SIMD processors
![Page 85: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/85.jpg)
Vector/SIMD processorsmain() {
for(i=0; i<CORES; i++)
pthread_create(exec);
while(true) {
ops, regs = fetch_decode();
enqueue(ops, regs);
}
}
exec() {
ops, regs = dequeue();
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
![Page 86: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/86.jpg)
Vector/SIMD processorsmain() {
for(i=0; i<CORES; i++)
pthread_create(exec);
while(true) {
ops, regs = fetch_decode();
enqueue(ops, regs);
}
}
exec() {
ops, regs = dequeue();
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
Single instruction stream, multiple computations
![Page 87: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/87.jpg)
Vector/SIMD processorsmain() {
for(i=0; i<CORES; i++)
pthread_create(exec);
while(true) {
ops, regs = fetch_decode();
enqueue(ops, regs);
}
}
exec() {
ops, regs = dequeue();
execute_calc_addrs(ops, regs);
access_memory(ops, regs);
write_back(regs);
}
Single instruction stream, multiple computations
But now all my instructions need multiple operands!
![Page 88: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/88.jpg)
22
Vector Processors
• Process multiple data elements simultaneously.
• Common in supercomputers of the 1970’s 80’s and 90’s.
• Modern CPUs support some vector processing instructions• Usually called SIMD
• Can operate on a few vectors elements per clock cycle in a pipeline or, • SIMD operate on all per clock cycle
![Page 89: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/89.jpg)
22
Vector Processors
• Process multiple data elements simultaneously.
• Common in supercomputers of the 1970’s 80’s and 90’s.
• Modern CPUs support some vector processing instructions• Usually called SIMD
• Can operate on a few vectors elements per clock cycle in a pipeline or, • SIMD operate on all per clock cycle
• 1962 University of Illinois Illiac IV - completed 1972 64 ALUs 100-150 MFlops
• (1973) TI’s Advance Scientific Computer (ASC) 20-80 MFlops
• (1975) Cray-1 first to have vector registers instead of keeping data in memory
![Page 90: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/90.jpg)
22
Vector Processors
• Process multiple data elements simultaneously.
• Common in supercomputers of the 1970’s 80’s and 90’s.
• Modern CPUs support some vector processing instructions• Usually called SIMD
• Can operate on a few vectors elements per clock cycle in a pipeline or, • SIMD operate on all per clock cycle
• 1962 University of Illinois Illiac IV - completed 1972 64 ALUs 100-150 MFlops
• (1973) TI’s Advance Scientific Computer (ASC) 20-80 MFlops
• (1975) Cray-1 first to have vector registers instead of keeping data in memory
Single instruction stream, multiple data Programming model has to change
![Page 91: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/91.jpg)
Vector ProcessorsImplementation:
• Instruction fetch control logic shared
• Same instruction stream executed on
• Multiple pipelines
• Multiple different operands in parallel
![Page 92: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/92.jpg)
Vector ProcessorsImplementation:
• Instruction fetch control logic shared
• Same instruction stream executed on
• Multiple pipelines
• Multiple different operands in parallel
![Page 93: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/93.jpg)
Vector ProcessorsImplementation:
• Instruction fetch control logic shared
• Same instruction stream executed on
• Multiple pipelines
• Multiple different operands in parallel
![Page 94: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/94.jpg)
Vector ProcessorsImplementation:
• Instruction fetch control logic shared
• Same instruction stream executed on
• Multiple pipelines
• Multiple different operands in parallel
GPUs: same basic idea
![Page 95: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/95.jpg)
When does vector processing help?
![Page 96: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/96.jpg)
When does vector processing help?
What are the potential bottlenecks here?When can it improve throughput?
![Page 97: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/97.jpg)
When does vector processing help?
What are the potential bottlenecks here?When can it improve throughput?
Only helps if memory can keep the pipeline busy!
![Page 98: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/98.jpg)
Hardware multi-threading
![Page 99: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/99.jpg)
Hardware multi-threading
• Address memory bottleneck
![Page 100: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/100.jpg)
Hardware multi-threading
• Address memory bottleneck
• Share exec unit across • Instruction streams
• Switch on stalls
![Page 101: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/101.jpg)
Hardware multi-threading
• Address memory bottleneck
• Share exec unit across • Instruction streams
• Switch on stalls
![Page 102: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/102.jpg)
Hardware multi-threading
• Address memory bottleneck
• Share exec unit across • Instruction streams
• Switch on stalls
![Page 103: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/103.jpg)
Hardware multi-threading
• Address memory bottleneck
• Share exec unit across • Instruction streams
• Switch on stalls
• Looks like multiple cores to the OS
![Page 104: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/104.jpg)
Hardware multi-threading
• Address memory bottleneck
• Share exec unit across • Instruction streams
• Switch on stalls
• Looks like multiple cores to the OS
• Three variants:• Coarse
• Fine-grain
• Simultaneous
![Page 105: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/105.jpg)
Running example
Thread A Thread B Thread C Thread D
• Colors pipeline full• White stall
![Page 106: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/106.jpg)
Coarse- grained multithreading
![Page 107: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/107.jpg)
Coarse- grained multithreading
• Single thread runs until a costly stall• E.g. 2nd level cache miss
![Page 108: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/108.jpg)
Coarse- grained multithreading
• Single thread runs until a costly stall• E.g. 2nd level cache miss
• Another thread starts during stall• Pipeline fill time requires several cycles!
![Page 109: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/109.jpg)
Coarse- grained multithreading
• Single thread runs until a costly stall• E.g. 2nd level cache miss
• Another thread starts during stall• Pipeline fill time requires several cycles!
![Page 110: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/110.jpg)
Coarse- grained multithreading
• Single thread runs until a costly stall• E.g. 2nd level cache miss
• Another thread starts during stall• Pipeline fill time requires several cycles!
• Does not cover short stalls
![Page 111: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/111.jpg)
Coarse- grained multithreading
• Single thread runs until a costly stall• E.g. 2nd level cache miss
• Another thread starts during stall• Pipeline fill time requires several cycles!
• Does not cover short stalls
• Hardware support required• PC and register file for each thread
• little other hardware
• Looks like another physical CPU to OS/software
![Page 112: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/112.jpg)
Coarse- grained multithreading
• Single thread runs until a costly stall• E.g. 2nd level cache miss
• Another thread starts during stall• Pipeline fill time requires several cycles!
• Does not cover short stalls
• Hardware support required• PC and register file for each thread
• little other hardware
• Looks like another physical CPU to OS/software
![Page 113: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/113.jpg)
Fine-grained multithreading
![Page 114: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/114.jpg)
Fine-grained multithreading
• Threads interleave instructions• Round-robin
• Skip stalled threads
![Page 115: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/115.jpg)
Fine-grained multithreading
• Threads interleave instructions• Round-robin
• Skip stalled threads
![Page 116: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/116.jpg)
Fine-grained multithreading
• Threads interleave instructions• Round-robin
• Skip stalled threads
• Hardware support required• Separate PC and register file per thread
• Hardware to control alternating pattern
![Page 117: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/117.jpg)
Fine-grained multithreading
• Threads interleave instructions• Round-robin
• Skip stalled threads
• Hardware support required• Separate PC and register file per thread
• Hardware to control alternating pattern
• Naturally hides delays• Data hazards, Cache misses
• Pipeline runs with rare stalls
![Page 118: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/118.jpg)
Fine-grained multithreading
• Threads interleave instructions• Round-robin
• Skip stalled threads
• Hardware support required• Separate PC and register file per thread
• Hardware to control alternating pattern
• Naturally hides delays• Data hazards, Cache misses
• Pipeline runs with rare stalls
• Doesn’t make full use of multi-issue
![Page 119: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/119.jpg)
Fine-grained multithreading
• Threads interleave instructions• Round-robin
• Skip stalled threads
• Hardware support required• Separate PC and register file per thread
• Hardware to control alternating pattern
• Naturally hides delays• Data hazards, Cache misses
• Pipeline runs with rare stalls
• Doesn’t make full use of multi-issue
![Page 120: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/120.jpg)
Simultaneous Multithreading (SMT)
![Page 121: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/121.jpg)
Simultaneous Multithreading (SMT)• Instructions from multiple threads
issued on same cycle• Uses register renaming
• dynamic scheduling facility of multi-issue architecture
![Page 122: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/122.jpg)
Simultaneous Multithreading (SMT)• Instructions from multiple threads
issued on same cycle• Uses register renaming
• dynamic scheduling facility of multi-issue architecture
Skip A
Skip C
![Page 123: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/123.jpg)
Simultaneous Multithreading (SMT)• Instructions from multiple threads
issued on same cycle• Uses register renaming
• dynamic scheduling facility of multi-issue architecture
• Hardware support:• Register files, PCs per thread
• Temporary result registers pre commit
• Support to sort out which threads get results from which instructions
Skip A
Skip C
![Page 124: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/124.jpg)
Simultaneous Multithreading (SMT)• Instructions from multiple threads
issued on same cycle• Uses register renaming
• dynamic scheduling facility of multi-issue architecture
• Hardware support:• Register files, PCs per thread
• Temporary result registers pre commit
• Support to sort out which threads get results from which instructions
• Maximal util. of execution units
Skip A
Skip C
![Page 125: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/125.jpg)
Simultaneous Multithreading (SMT)• Instructions from multiple threads
issued on same cycle• Uses register renaming
• dynamic scheduling facility of multi-issue architecture
• Hardware support:• Register files, PCs per thread
• Temporary result registers pre commit
• Support to sort out which threads get results from which instructions
• Maximal util. of execution units
Skip A
Skip C
![Page 126: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/126.jpg)
Why Vector and Multithreading Background?
GPU:
• A very wide vector machine
• Massively multi-threaded to hide memory latency
• Originally designed for graphics pipelines…
![Page 127: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/127.jpg)
Graphics ~= Rendering
3510/30/2018
![Page 128: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/128.jpg)
Graphics ~= Rendering
Inputs
3510/30/2018
![Page 129: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/129.jpg)
Graphics ~= Rendering
Inputs• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures
geometry
3510/30/2018
![Page 130: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/130.jpg)
Graphics ~= Rendering
Inputs• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures
geometry
• Light locations and properties• Attempt to model surtface/light interactions with
modeled objects/materials
3510/30/2018
![Page 131: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/131.jpg)
Graphics ~= Rendering
Inputs• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures
geometry
• Light locations and properties• Attempt to model surtface/light interactions with
modeled objects/materials
• View point
3510/30/2018
![Page 132: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/132.jpg)
Graphics ~= Rendering
Inputs• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures
geometry
• Light locations and properties• Attempt to model surtface/light interactions with
modeled objects/materials
• View point
Output
3510/30/2018
![Page 133: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/133.jpg)
Graphics ~= Rendering
Inputs• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures
geometry
• Light locations and properties• Attempt to model surtface/light interactions with
modeled objects/materials
• View point
Output• 2D projection seen from the view-point
3510/30/2018
![Page 134: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/134.jpg)
Graphics ~= Rendering
Inputs• 3D world model(objects, materials)
• Geometry modeled w triangle meshes, surface normals• GPUs subdivide triangles into “fragments” (rasterization)• Materials modeled with “textures”• Texture coordinates, sampling “map” textures
geometry
• Light locations and properties• Attempt to model surtface/light interactions with
modeled objects/materials
• View point
Output• 2D projection seen from the view-point
3510/30/2018
![Page 135: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/135.jpg)
Grossly over-simplified rendering algorithm
Dandelion 3610/30/2018
![Page 136: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/136.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
Dandelion 3610/30/2018
![Page 137: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/137.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
Dandelion 3610/30/2018
![Page 138: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/138.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
Dandelion 3610/30/2018
![Page 139: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/139.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
fragment[] frags = {};
Dandelion 3610/30/2018
![Page 140: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/140.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
Dandelion 3610/30/2018
![Page 141: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/141.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
Dandelion 3610/30/2018
![Page 142: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/142.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
Dandelion 3610/30/2018
![Page 143: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/143.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
Dandelion 3610/30/2018
![Page 144: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/144.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
Dandelion 3610/30/2018
![Page 145: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/145.jpg)
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
Dandelion 3610/30/2018
![Page 146: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/146.jpg)
Algorithm Graphics Pipelineforeach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
Dandelion 37
OpenGL pipeline
To first order, DirectX looks the same!
10/30/2018
![Page 147: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/147.jpg)
Algorithm Graphics Pipelineforeach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
Dandelion 37
OpenGL pipeline
To first order, DirectX looks the same!
10/30/2018
![Page 148: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/148.jpg)
Algorithm Graphics Pipelineforeach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
Dandelion 37
OpenGL pipeline
To first order, DirectX looks the same!
10/30/2018
![Page 149: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/149.jpg)
Algorithm Graphics Pipelineforeach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
Dandelion 37
OpenGL pipeline
To first order, DirectX looks the same!
10/30/2018
![Page 150: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/150.jpg)
Algorithm Graphics Pipelineforeach(vertex v in model)
map vmodel vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
frags.add(rasterize(t));
foreach fragment f in frags
choose_color(f);
display(visible_fragments(frags));
Dandelion 37
OpenGL pipeline
To first order, DirectX looks the same!
10/30/2018
![Page 151: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/151.jpg)
Graphics pipeline GPU architecture
Dandelion 38
Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count
GeForce 6 series
10/30/2018
![Page 152: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/152.jpg)
Graphics pipeline GPU architecture
Dandelion 38
Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count
GeForce 6 series
10/30/2018
![Page 153: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/153.jpg)
Graphics pipeline GPU architecture
Dandelion 38
Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count
GeForce 6 series
10/30/2018
![Page 154: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/154.jpg)
Graphics pipeline GPU architecture
Dandelion 38
Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count
GeForce 6 series
10/30/2018
![Page 155: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/155.jpg)
Graphics pipeline GPU architecture
Dandelion 38
Limited “programmability” of shaders:Minimal/no control flowMaximum instruction count
GeForce 6 series
10/30/2018
![Page 156: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/156.jpg)
Late Modernity: unified shaders
Dandelion 39
Mapping to Graphics pipeline no longer apparentProcessing elements no longer specialized to a particular roleModel supports real control flow, larger instr count10/30/2018
![Page 157: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/157.jpg)
Mostly Modern: Pascal
![Page 158: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/158.jpg)
Definitely Modern: Turing
![Page 159: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/159.jpg)
Modern Enough: Pascal SM
![Page 160: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/160.jpg)
Cross-generational observations
GPUs designed for parallelism in graphics pipeline:
• Data• Per-vertex• Per-fragment• Per-pixel
• Task• Vertex processing• Fragment processing• Rasterization• Hidden-surface elimination
• MLP• HW multi-threading for hiding memory latency
Dandelion 4310/30/2018
![Page 161: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/161.jpg)
Cross-generational observations
GPUs designed for parallelism in graphics pipeline:
• Data• Per-vertex• Per-fragment• Per-pixel
• Task• Vertex processing• Fragment processing• Rasterization• Hidden-surface elimination
• MLP• HW multi-threading for hiding memory latency
Dandelion 43
Even as GPU architectures become more general, certain assumptions persist:1. Data parallelism is trivially exposed2. All problems look like painting a box
with colored dots
10/30/2018
![Page 162: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/162.jpg)
Cross-generational observations
GPUs designed for parallelism in graphics pipeline:
• Data• Per-vertex• Per-fragment• Per-pixel
• Task• Vertex processing• Fragment processing• Rasterization• Hidden-surface elimination
• MLP• HW multi-threading for hiding memory latency
Dandelion 43
Even as GPU architectures become more general, certain assumptions persist:1. Data parallelism is trivially exposed2. All problems look like painting a box
with colored dots
But what if my problem isn’t painting a box?!!?!
10/30/2018
![Page 163: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/163.jpg)
The big ideas still present in GPUs
• Simple cores
• Single instruction stream• Vector instructions (SIMD) OR
• Implicit HW-managed sharing (SIMT)
• Hide memory latency with HW multi-threading
Dandelion 4410/30/2018
![Page 164: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/164.jpg)
Programming Model
• GPUs are I/O devices, managed by user-code
• “kernels” == “shader programs”
• 1000s of HW-scheduled threads per kernel
• Threads grouped into independent blocks.• Threads in a block can synchronize (barrier)
• This is the *only* synchronization
• “Grid” == “launch” == “invocation” of a kernel • a group of blocks (or warps)
Dandelion 5110/30/2018
![Page 165: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/165.jpg)
Parallel Algorithms
• Sequential algorithms often do not permit easy parallelization• Does not mean there work has no parallelism• A different approach can yield parallelism• but often changes the algorithm • Parallelizing != just adding locks to a sequential algorithm
• Parallel Patterns• Map• Scatter, Gather• Reduction• Scan• Search, Sort
![Page 166: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/166.jpg)
Parallel Algorithms
• Sequential algorithms often do not permit easy parallelization• Does not mean there work has no parallelism• A different approach can yield parallelism• but often changes the algorithm • Parallelizing != just adding locks to a sequential algorithm
• Parallel Patterns• Map• Scatter, Gather• Reduction• Scan• Search, Sort
If you can express your algorithm using these patterns,
an apparently fundamentally sequential algorithm can be
made parallel
![Page 167: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/167.jpg)
Map
• Inputs• Array A
• Function f(x)
• map(A, f) apply f(x) on all elements in A
• Parallelism trivially exposed• f(x) can be applied in parallel to all elements, in principle
![Page 168: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/168.jpg)
Map
• Inputs• Array A
• Function f(x)
• map(A, f) apply f(x) on all elements in A
• Parallelism trivially exposed• f(x) can be applied in parallel to all elements, in principle
for(i=0; i<numPoints; i++) {labels[i] = findNearestCenter(points[i]);
}
map(points, findNearestCenter)
![Page 169: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/169.jpg)
Scatter and Gather
• Gather:• Read multiple items to single location
• Scatter:• Write single data item to multiple locations
![Page 170: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/170.jpg)
Scatter and Gather
• Gather:• Read multiple items to single location
• Scatter:• Write single data item to multiple locations
for (i=0; i<N; ++i)x[i] = y[idx[i]];
for (i=0; i<N; ++i)y[idx[i]] = x[i];
gather(x, y, idx)
scatter(x, y, idx)
![Page 171: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/171.jpg)
Reduce
• Input• Associative operator op
• Ordered set s = [a, b, c, … z]
• Reduce(op, s) returns a op b op c … op z
![Page 172: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/172.jpg)
Reduce
• Input• Associative operator op
• Ordered set s = [a, b, c, … z]
• Reduce(op, s) returns a op b op c … op z
for(i=0; i<N; ++i) {accum += (point[i]*point[i])
}accum = reduce(*, point)
![Page 173: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/173.jpg)
Reduce
• Input• Associative operator op
• Ordered set s = [a, b, c, … z]
• Reduce(op, s) returns a op b op c … op z
for(i=0; i<N; ++i) {accum += (point[i]*point[i])
}accum = reduce(*, point)
Why must op be associative?
![Page 174: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/174.jpg)
Reduce
• Input• Associative operator op
• Ordered set s = [a, b, c, … z]
• Reduce(op, s) returns a op b op c … op z
for(i=0; i<N; ++i) {accum += (point[i]*point[i])
}accum = reduce(*, point)
Why must op be associative?
![Page 175: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/175.jpg)
Scan (prefix sum)
• Input• Associative operator op
• Ordered set s = [a, b, c, … z]
• Identity I
• scan(op, s) = [I, a, (a op b), (a op b op c) …]
• Scan is the workhorse of parallel algorithms:• Sort, histograms, sparse matrix, string compare, …
![Page 176: Go Wrap up Parallel Architecturesrossbach/cs378h/...Google Search •Workload: •Accept query •Return page of results (with ugh, ads) •Get search results by sending query to •Web](https://reader034.fdocuments.net/reader034/viewer/2022051916/6007aa1236c20378da3fd1a1/html5/thumbnails/176.jpg)
Summary
• Re-expressing apparently sequential algorithms as combinations of parallel patterns is a common technique when targeting GPUs