Lecture3

CS-416 Parallel and Distributed Systems

Jawwad ShamsiLecture #3

20th January 2010

Announcement

• Possible Name Change to– High Performance Computing

Recap

• Pipelining• Vector Instruction• Super Scalar Execution

Super-Scalar Execution

Dependencies

• Data Dependency• Resource dependency• Branch Dependency

Dynamic Instruction Issue

• 3rd Segment– Processor needs capability of • Out of order sequencing

Limitations of Memory Systems

• Latency• Bandwidth

Effect of Latency - Example

• 1 GHZ processor (1 ns)– 100 ns latency– Two multiply-add units

• four instructions in each cycle of 1 ns

– Peak Rating• 4GLOPS• Memory latency 100 cycles • block size is one word• Processor must wait 100 cycles before it can process the data.

– Peak speed 1 floating point operation / 100 nsec– 10 MFLOPS

Effect of Bandwidth

• Process 1 GHZ• 100 cycle latency DRAM • Block size is one word, the processor takes 100

cycles to fetch each word. • Therefore, the algorithm performs one FLOP

every 100 cycles for a peak speed of 10 MFLOPS

• Increase Block Size??

• 1 for (i = 0; i < 1000; i++) – 2 column_sum[i] = 0.0; – 3 for (j = 0; j < 1000; j++) • 4 column_sum[i] += b[j][i];

• Pre-fetching• Multi-Threading

Impact of bandwidth on multithreaded programs

• Threads share Memory– Cache• Cache size will be limited• Limited Cache-hit ratio

– Decrease in effective bandwith

Simple Execution

• for(i=0;i<n;i++) • 2 c[i] = dot_product(get_row(a, i), b);

Threaded Execution

• for(i=0;i<n;i++) – 2 c[i] = create_thread(dot_product, get_row(a, i),

b);

• 1 for (i = 0; i < 1000; i++) • 2 column_sum[i] = 0.0; 3 • for (j = 0; j < 1000; j++) – 4 for (i = 0; i < 1000; i++) • 5 column_sum[i] += b[j][i];

Lecture3

Technology

Transcript of Lecture3