Lecture3
-
Upload
asad-abbas -
Category
Technology
-
view
507 -
download
1
Transcript of Lecture3
CS-416 Parallel and Distributed Systems
Jawwad ShamsiLecture #3
20th January 2010
Announcement
• Possible Name Change to– High Performance Computing
Recap
• Pipelining• Vector Instruction• Super Scalar Execution
Super-Scalar Execution
Dependencies
• Data Dependency• Resource dependency• Branch Dependency
Dynamic Instruction Issue
• 3rd Segment– Processor needs capability of • Out of order sequencing
Limitations of Memory Systems
• Latency• Bandwidth
Effect of Latency - Example
• 1 GHZ processor (1 ns)– 100 ns latency– Two multiply-add units
• four instructions in each cycle of 1 ns
– Peak Rating• 4GLOPS• Memory latency 100 cycles • block size is one word• Processor must wait 100 cycles before it can process the data.
– Peak speed 1 floating point operation / 100 nsec– 10 MFLOPS
Effect of Bandwidth
• Process 1 GHZ• 100 cycle latency DRAM • Block size is one word, the processor takes 100
cycles to fetch each word. • Therefore, the algorithm performs one FLOP
every 100 cycles for a peak speed of 10 MFLOPS
• Increase Block Size??
• 1 for (i = 0; i < 1000; i++) – 2 column_sum[i] = 0.0; – 3 for (j = 0; j < 1000; j++) • 4 column_sum[i] += b[j][i];
• Pre-fetching• Multi-Threading
Impact of bandwidth on multithreaded programs
• Threads share Memory– Cache• Cache size will be limited• Limited Cache-hit ratio
– Decrease in effective bandwith
Simple Execution
• for(i=0;i<n;i++) • 2 c[i] = dot_product(get_row(a, i), b);
Threaded Execution
• for(i=0;i<n;i++) – 2 c[i] = create_thread(dot_product, get_row(a, i),
b);
• 1 for (i = 0; i < 1000; i++) • 2 column_sum[i] = 0.0; 3 • for (j = 0; j < 1000; j++) – 4 for (i = 0; i < 1000; i++) • 5 column_sum[i] += b[j][i];