AOK-Lecture03.pdf

download AOK-Lecture03.pdf

of 28

description

slides aok pertemuan 3

Transcript of AOK-Lecture03.pdf

  • Arsitektur dan Organisasi Komputer Computers Performance

    Lecture 03 (27 Feb 2014)

    Henry Novianus Palit [email protected]

  • Designing for Performance (1)

    Computer trends: The cost of computer systems continue to drop

    The performance and capacity of computer systems continue to rise; e.g., todays laptops have the computing power of an IBM mainframe from 10-15 years ago

    The speed of a computer in executing a program is affected by

    design of its instruction set

    design of its hardware

    design of its software (including the OS and the compiler)

    technology in which the hardware is implemented

    Arsitektur & Organisasi Komputer 2

  • Designing for Performance (2)

    The speed of switching between 0 and 1 states in logic circuits is largely determined by the size of the transistors that implement the circuits; i.e., smaller transistors switch faster

    Reducing transistor sizes has two advantages: Instructions can be executed faster

    More transistors can be placed on a chip, leading to more logic functionality and more memory storage capacity

    Gordon Moore (Intel co-founder): The number of transistors incorporated in a chip will approximately

    double every 24 months

    The raw speed of a microprocessor will not achieve its potential unless it is fed by a constant stream of work (i.e., computer instructions)

    Arsitektur & Organisasi Komputer 3

  • Arsitektur & Organisasi Komputer 4

  • Designing for Performance (3)

    Techniques to exploit the raw speed of a processor: Pipelining (a kind of instruction-level parallelism) the processor works on multiple instructions by moving data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously; e.g., while an instruction is being executed, the next instruction is being fetched and decoded Branch prediction the processor looks ahead in the instruction code fetched from the memory, predicts which branches or groups of instructions are likely to be processed next, and prefetches those instructions (could be multiple branches ahead); thus, it increases the amount of work available for the processor to execute Data flow analysis the processor analyzes which instructions are dependent on each others results to create an optimized schedule of instructions; thus, it prevents unnecessary delay Speculative execution using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations; thus, it keeps the execution engines busy

    Arsitektur & Organisasi Komputer 5

  • Five-Stage Pipeline

    Arsitektur & Organisasi Komputer 6

  • Parallelism (1)

    Instruction-level parallelism Pipelining also see the previous slides

    Pipelining allows a trade-off between latency (how long it takes to execute an instruction) and processor bandwidth (how many instructions/sec the CPU can complete)

    Superscalar architectures A dual pipeline or a single pipeline with multiple functional units

    The two instructions must neither conflict over resource usage (e.g., registers) nor depend on the result of the other either guaranteed by the compiler or detected & eliminated during execution by extra hardware

    Most of the functional units in stage 4 take appreciably longer than one clock cycle to execute, certainly the ones that access memory or do floating-point arithmetic

    Arsitektur & Organisasi Komputer 7

  • Superscalar Architectures

    Arsitektur & Organisasi Komputer 8

  • Parallelism (2)

    Processor-level parallelism Multicore processors

    Fabricating multiple processing units on a single chip, e.g., dual-core, quad-core, hex-core, etc.

    Data parallel processors SIMD (Single Instruction-stream Multiple Data-stream) procrs consists of a large number of identical processors that perform the same sequence of instructions on different sets of data (e.g., GPUs / Graphics Processing Units)

    Vector processors very efficient at executing a sequence of operations on pairs of data elements, but all of the operations are performed in a single, heavily pipelined functional unit (e.g., SSE / Streaming SIMD Extension from Intel)

    Arsitektur & Organisasi Komputer 9

  • SIMD Processor

    Processing steps per cycle:

    The scheduler selects two threads to execute on the processor

    The next instruction from each thread then executes on up to 16 SIMD cores

    If each thread is able to use all of 16 SIMD cores, a fully loaded GPU with 32 SMs (stream multiprocrs) can perform 512 ops / cycle a similar-sized general purpose quad-core CPU would struggle to achieve 1/32 as much processing

    Arsitektur & Organisasi Komputer 10

  • Parallelism (3)

    Processor-level parallelism (contd) Multiprocessors (SMP / symmetric multiprocessing)

    Computer systems that contain many processors, each possibly containing multiple cores

    Used for either executing a number of different application tasks concurrently or executing subtasks of a single large task in parallel

    All processors usually have access to all of the memory shared-memory multiprocessor

    Multicomputers (distributed or cluster computing) Using an interconnected group of computers to achieve high total computational power

    Computers normally have access only to their own memory units

    Sharing data is done by exchanging messages over a communication network message-passing multicomputer

    Arsitektur & Organisasi Komputer 11

  • Multiprocessor

    Arsitektur & Organisasi Komputer 12

    A single-bus multiprocessor A multiprocessor with local memories (NUMA / Non-Uniform Memory Access)

  • Tianhe-2

    Worlds Fastest Computer (by November 2013)

    source: http://www.top500.org

    Processors : Intel Xeon E5-2692 (12C) + Intel Xeon Phi 31S1P Total cores : 3,120,000 Memory : 1 PB Interconnect : TH Express-2 Linpack performance : 33,862.7 TFlop/s (peak = 54,902.4 TFlop/s) Power : 17,8 MW OS : Kylin Linux MPI : MPICH2

    Arsitektur & Organisasi Komputer 13

  • Performance Assessment

    Performance is a key parameter in evaluating a computer system, along with cost, size, security, reliability, and power consumption

    Raw speed is far less important than how a processor performs when executing a given application

    Some measures of computers performance: Clock speed

    Instruction execution rate

    Benchmarks

    Amdahls Law

    Littles Law

    Arsitektur & Organisasi Komputer 14

  • Clock Speed

    Clock speed or clock rate is measured in cycles/second or Hertz (Hz)

    Clock signals typically are generated by a quartz crystal, which generate a constant signal wave while power is applied; the wave is in turn converted into a digital voltage pulse stream

    Since the execution of an instruction involves a number of steps such as fetching the instruction from memory, decoding the instruction, loading & storing data, and performing arithmetic & logical operations most instructions require multiple clock cycles to complete

    A straight comparison of clock speeds on different processors does not tell the whole story about performance (e.g., when pipelining is used)

    Arsitektur & Organisasi Komputer 15

  • Instruction Execution Rate (1)

    For a given processor, the number of clock cycles required varies for different types of instructions Average Cycles Per Instruction (CPI) for a given program is where CPIi = number of cycles required for instruction type i Ii = number of executed instructions of type i Ic = instruction count (number of instructions) n = number of instruction types The processor time (T) needed to execute a given program is where = the constant cycle time = 1/f (f = the constant frequency)

    Arsitektur & Organisasi Komputer 16

    c

    n

    i ii

    I

    ICPICPI

    1)(

    CPIIT c

  • Instruction Execution Rate (2)

    To differentiate memory and processor cycle times, the preceding equation can be rewritten as where p = number of processor cycles needed to decode & execute the instruction m = number of memory references needed (on avg) k = ratio between memory & processor cycle times System attributes that influence the performance factors (Ic, p, m, k, )

    Arsitektur & Organisasi Komputer 17

    kmpIT c

  • Instruction Execution Rate (3)

    A common measure of performance for a processor is the rate at which instructions are executed, expressed as millions of instructions per second (MIPS)

    Another common performance measure deals only with floating-point instructions (which are common in many scientific and game applications) is expressed as millions of floating-point operations per second (MFLOPS)

    Arsitektur & Organisasi Komputer 18

    66 1010rateMIPS

    CPI

    f

    T

    Ic

    610timeexecution

    programainoperationspoint-floatingexecutedofnumberrateMFLOPS

  • Benchmarks (1)

    MIPS and MFLOPS often are inadequate to evaluate the processors performance (e.g., CISC vs. RISC machines may have different MIPS rates although both take about the same amount of time)

    In early 1990s, measuring the performance of systems is shifted to using a set of benchmark programs

    Desirable characteristics of a benchmark program: Written in a high-level language, making it portable across machines

    Representative of a particular kind of programming style such as systems programming, numerical programming, or commercial programming

    Can be measured easily

    Has wide distribution

    Arsitektur & Organisasi Komputer 19

  • Benchmarks (2)

    SPEC (System Performance Evaluation Corporation) benchmarks defined and maintained by an industry consortium (e.g., SPEC CPU2006, SPECjvm98, SPECjbb2000, SPECweb99, SPECmail2001)

    Averaging results run a number of different benchmark programs on each machine and then average the results where Ri = high-level language instruction execution rate for benchmark i

    Arsitektur & Organisasi Komputer 20

    m

    i

    iA Rm

    R1

    1meanarithmeticsimple

    m

    i i

    H

    R

    mR

    1

    1meanharmonic

  • Benchmarks (3)

    SPEC benchmarks concern with speed metric and rate metric Speed metric measures the ability of a machine to complete a task

    Results are reported as the ratio of the reference run time to the system (under test) run time The overall performance measure for the system under test is calculated by averaging the ratios values by a geometric mean

    Rate metric measures the throughput or rate of a machine carrying out a number of tasks

    Multiple copies (i.e., as many as the number of processors) of the benchmarks are run simultaneously and a ratio is reported

    Here, Tsuti is the elapsed time from the start of the execution of the program on all N processors until the completion of all copies A geometric mean is used to determine the overall performance measure

    Arsitektur & Organisasi Komputer 21

    i

    ii

    Tsut

    Trefr

    nn

    i

    iG rr

    1

    1

    i

    ii

    Tsut

    TrefNr

  • Amdahls Law (1)

    Proposed by Gene Amdahl and deals with the potential speedup of a program using multiple processors compared to a single processor

    Consider a program running on a single processor such that a fraction (1f) of the execution time involves code that is inherently serial and a fraction f involves code that is inherently parallelizable with no scheduling overhead Let T be the total execution time of the program using a single processor The speedup using a parallel processor with N processors that fully exploits the parallel portion of the program is

    Arsitektur & Organisasi Komputer 22

    N

    ff

    N

    TffT

    TffT

    N

    )1(

    1

    )1(

    1

    processorsparallelonprogramexecutetotime

    processorsingleaonprogramexecutetotimeSpeedup

  • Amdahls Law (2)

    Two drawn conclusions: When f is small, the use of parallel processors has little effect As N approaches infinity, speedup is bound by 1/(1f), so there are diminishing returns for using more processors

    The conclusions are too pessimistic, as a server can execute multiple threads or multiple tasks in parallel and exploit data parallelism Speedup in general can be expressed as If a targeted enhancement is applied to fraction f, and the fractions speedup after enhancement is SUf, the overall speedup is

    Arsitektur & Organisasi Komputer 23

    tenhancemenaftertimeexec

    tenhancemen beforetimeexec

    tenhancemenbeforeeperformanc

    tenhancemenaftereperformancSpeedup

    fSU

    ff

    1

    1Speedup

  • Littles Law

    Based on a queuing theory, Littles Law can be applied to any system that is statistically in steady state and in which there is no leakage

    General setup: Suppose there is a steady state system where items arrive at an average rate of items per unit time

    The item stay in the system an average of W units of time

    There is an average of L units in the system at any one time

    Littles Law relates these three variables as L = W

    Under steady state conditions, the average number of items in a queuing system equals the average rate at which items arrive multiplied by the average time that an item spends in the system

    Arsitektur & Organisasi Komputer 24

  • Example: MIPS rate

    Consider a program that is executed on a 400-MHz processor. The instruction mix and the CPI for each instruction type are given on the table below Calculate the MIPS rate!

    Solution: Average CPI = (1 x 0.6)+(2 x 0.18)+(4 x 0.12)+(8 x 0.1) = 2.24 MIPS rate = (400 x 106) / (2.24 x 106) 178.6

    Arsitektur & Organisasi Komputer 25

  • Example: Speed Metric

    The table below shows the SPEC integer speed ratios for twelve benchmark programs on the Sun Blade 6250 Calculate the speed metric! Solution: Speed metric = (17.5 x 14.0 x 13.7 x 17.6 x 14.7 x 18.6 x 17.0 x 31.3 x 23.7 x 9.23 x 10.9 x 14.7)1/12 = 16.09

    Arsitektur & Organisasi Komputer 26

  • Example: Speedup (Amdahls Law)

    Problem: Suppose that a task makes extensive use of floating-point operations, with 40% of the time is consumed by floating-point operations

    With a new hardware design, the floating-point module is speeded up by a factor of K

    Calculate the maximum overall speedup!

    Solution: Thus, independent of K, the maximum speedup is 1.67

    Arsitektur & Organisasi Komputer 27

    KK

    4.06.0

    1

    4.04.01

    1Speedup

  • Homework

    Four benchmark programs are executed on three computers with the following results:

    The table shows the execution time in seconds, with 100,000,000 instructions executed in each program.

    Calculate the MIPS values for each computer for each program!

    Calculate the arithmetic and harmonic means assuming equal weights for the four programs, and rank the computers based on arithmetic mean and harmonic mean!

    Arsitektur & Organisasi Komputer 28