CS267/E233 Applications of Parallel Computers Lecture 1: Introduction
description
Transcript of CS267/E233 Applications of Parallel Computers Lecture 1: Introduction
![Page 1: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/1.jpg)
01/17/2007 CS267-Lecture 1 1
CS267/E233Applications of Parallel
Computers
Lecture 1: Introduction
Kathy Yelick
www.cs.berkeley.edu/~yelick/cs267_spr07
![Page 2: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/2.jpg)
01/17/2007 CS267-Lecture 1 2
Why powerful computers are
parallel
circa 1991-2006
![Page 3: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/3.jpg)
01/17/2007 CS267-Lecture 1 3
Tunnel Vision by Experts
• “I think there is a world market for maybe five computers.”
- Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have a computer in their home”
- Ken Olson, president and founder of Digital Equipment Corporation, 1977.
• “640K [of memory] ought to be enough for anybody.”- Bill Gates, chairman of Microsoft,1981.
• “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”
- Ken Kennedy, CRPC Directory, 1994
Slide source: Warfield et al.
![Page 4: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/4.jpg)
01/17/2007 CS267-Lecture 1 4
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
Slide source: Jack Dongarra
![Page 5: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/5.jpg)
01/17/2007 CS267-Lecture 1 5
Microprocessor Transistors per Chip
i4004
i80286
i80386
i8080
i8086
R3000R2000
R10000
Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Tran
sist
ors
• Growth in transistors per chip • Increase in clock rate
0.1
1
10
100
1000
1970 1980 1990 2000
Year
Clo
ck R
ate
(MH
z)
![Page 6: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/6.jpg)
01/17/2007 CS267-Lecture 1 6
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks by a factor of x ?
• Clock rate goes up by x because wires are shorter
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
- typically x3 is devoted either on-chip
- parallelism: hidden parallelism such as ILP
- locality: caches
![Page 7: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/7.jpg)
01/17/2007 CS267-Lecture 1 7
But there are limiting forces
• Moore’s 2nd law (Rock’s law): costs go up
Demo of 0.06 micron CMOS
Source: Forbes Magazine
• Yield-What percentage of the chips are usable?
-E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield
Manufacturing costs and yield problems limit use of density
![Page 8: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/8.jpg)
01/17/2007 CS267-Lecture 1 8
Power Density Limits Serial Performance
![Page 9: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/9.jpg)
01/17/2007 CS267-Lecture 1 9
More Limits: How fast can a serial computer be?
• Consider the 1 Tflop/s sequential machine:
- Data must travel some distance, r, to get from memory to CPU.
- To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
- Each bit occupies about 1 square Angstrom, or the size of a small atom.
• No choice but parallelism
r = 0.3 mm
1 Tflop/s, 1 Tbyte sequential machine
![Page 10: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/10.jpg)
01/17/2007 CS267-Lecture 1 10
Revolution is Happening Now
• Chip density is continuing increase ~2x every 2 years
- Clock speed is not
- Number of processor cores may double instead
• There is little or no hidden parallelism (ILP) to be found
• Parallelism must be exposed to and managed by software
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
![Page 11: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/11.jpg)
01/17/2007 CS267-Lecture 1 11
Why Parallelism (2007)?
• These arguments are no long theoretical
• All major processor vendors are producing multicore chips
- Every machine will soon be a parallel machine
- All programmers will be parallel programmers???
• New software model- Want a new feature? Hide the “cost” by speeding up the code first
- All programmers will be performance programmers???
• Some may eventually be hidden in libraries, compilers, and high level languages
- But a lot of work is needed to get there
• Big open questions:- What will be the killer apps for multicore machines
- How should the chips be designed, and how will they be programmed?
![Page 12: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/12.jpg)
01/17/2007 CS267-Lecture 1 12
Outline
• Why powerful computers must be parallel processors
• Large important problems require powerful computers
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
Even computer games
Including your laptop
all
![Page 13: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/13.jpg)
01/17/2007 CS267-Lecture 1 13
Why we need powerful computers
![Page 14: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/14.jpg)
01/17/2007 CS267-Lecture 1 14
Units of Measure in HPC
• High Performance Computing (HPC) units are:- Flop: floating point operation
- Flops/s: floating point operations per second
- Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes
Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes
Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes
Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes
Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes
Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes
Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes
• See www.top500.org for current list of fastest machines
![Page 15: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/15.jpg)
01/17/2007 CS267-Lecture 1 15
Simulation: The Third Pillar of Science
• Traditional scientific and engineering paradigm:1) Do theory or paper design.
2) Perform experiments or build system.
• Limitations:- Too difficult -- build large wind tunnels.
- Too expensive -- build a throw-away passenger jet.
- Too slow -- wait for climate or galactic evolution.
- Too dangerous -- weapons, drug design, climate experimentation.
• Computational science paradigm:3) Use high performance computer systems to simulate the
phenomenon- Base on known physical laws and efficient numerical
methods.
![Page 16: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/16.jpg)
01/17/2007 CS267-Lecture 1 16
Some Particularly Challenging Computations• Science
- Global climate modeling- Biology: genomics; protein folding; drug design- Astrophysical modeling- Computational Chemistry- Computational Material Sciences and Nanosciences
• Engineering- Semiconductor design- Earthquake and structural modeling- Computation fluid dynamics (airplane design)- Combustion (engine design)- Crash simulation
• Business- Financial and economic modeling- Transaction processing, web services and search engines
• Defense- Nuclear weapons -- test by simulations- Cryptography
![Page 17: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/17.jpg)
01/17/2007 CS267-Lecture 1 17
Economic Impact of HPC
• Airlines:- System-wide logistics optimization systems on parallel systems.- Savings: approx. $100 million per airline per year.
• Automotive design:- Major automotive companies use large systems (500+ CPUs) for:
- CAD-CAM, crash testing, structural integrity and aerodynamics.
- One company has 500+ CPU parallel system.- Savings: approx. $1 billion per company per year.
• Semiconductor industry:- Semiconductor firms use large systems (500+ CPUs) for
- device electronics simulation and logic validation - Savings: approx. $1 billion per company per year.
• Securities industry:- Savings: approx. $15 billion per year for U.S. home mortgages.
![Page 18: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/18.jpg)
01/17/2007 CS267-Lecture 1 18
$5B World Market in Technical Computing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1998 1999 2000 2001 2002 2003 Other
Technical Management andSupport
Simulation
Scientific Research and R&D
MechanicalDesign/Engineering Analysis
Mechanical Design andDrafting
Imaging
Geoscience and Geo-engineering
Electrical Design/EngineeringAnalysis
Economics/Financial
Digital Content Creation andDistribution
Classified Defense
Chemical Engineering
Biosciences
Source: IDC 2004, from NRC Future of Supercomputing Report
![Page 19: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/19.jpg)
01/17/2007 CS267-Lecture 1 19
Global Climate Modeling Problem
• Problem is to compute:f(latitude, longitude, elevation, time)
temperature, pressure, humidity, wind velocity
• Approach:- Discretize the domain, e.g., a measurement point every 10 km
- Devise an algorithm to predict weather at time t+t given t
• Uses:- Predict major events,
e.g., El Nino
- Use in setting air emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
![Page 20: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/20.jpg)
01/17/2007 CS267-Lecture 1 20
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere- Solve Navier-Stokes equations
- Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:- To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s
- Weather prediction (7 days in 24 hours) 56 Gflop/s
- Climate prediction (50 years in 30 days) 4.8 Tflop/s
- To use in policy negotiations (50 years in 12 hours) 288 Tflop/s
• To double the grid resolution, computation is 8x to 16x
• State of the art models require integration of atmosphere, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more
• Current models are coarser than this
![Page 21: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/21.jpg)
01/17/2007 CS267-Lecture 1 21
High Resolution Climate Modeling on NERSC-3 – P. Duffy,
et al., LLNL
![Page 22: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/22.jpg)
01/17/2007 CS267-Lecture 1 22
A 1000 Year Climate Simulation
• Warren Washington and Jerry Meehl, National Center for Atmospheric Research; Bert Semtner, Naval Postgraduate School; John Weatherly, U.S. Army Cold Regions Research and Engineering Lab Laboratory et al.
• http://www.nersc.gov/news/science/bigsplash2002.pdf
• Demonstration of the Community Climate Model (CCSM2)
• A 1000-year simulation shows long-term, stable representation of the earth’s climate.
• 760,000 processor hours used
• Temperature change shown
![Page 23: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/23.jpg)
01/17/2007 CS267-Lecture 1 23
Climate Modeling on the Earth Simulator System
Development of ES started in 1997 in order to make a comprehensive understanding of global environmental changes such as global warming.
26.58Tflops was obtained by a global atmospheric circulation code.
35.86Tflops (87.5% of the peak performance) is achieved in the Linpack benchmark (world’s fastest machine from 2002-2004).
Its construction was completed at the end of February, 2002 and the practical operation started from March 1, 2002
![Page 24: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/24.jpg)
01/17/2007 CS267-Lecture 1 24
Astrophysics: Binary Black Hole Dynamics
• Massive supernova cores collapse to black holes. • At black hole center spacetime breaks down. • Critical test of theories of gravity –
General Relativity to Quantum Gravity. • Indirect observation – most galaxies
have a black hole at their center.• Gravity waves show black hole directly
including detailed parameters.• Binary black holes most powerful
sources of gravity waves. • Simulation extraordinarily complex –
evolution disrupts the spacetime !
![Page 25: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/25.jpg)
01/17/2007 CS267-Lecture 1 25
![Page 26: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/26.jpg)
01/17/2007 CS267-Lecture 1 26
Heart Simulation
• Problem is to compute blood flow in the heart• Approach:
- Modeled as an elastic structure in an incompressible fluid.- The “immersed boundary method” due to Peskin and McQueen.- 20 years of development in model- Many applications other than the heart: blood clotting, inner ear,
paper making, embryo growth, and others- Use a regularly spaced mesh (set of points) for evaluating the fluid
• Uses- Current model can be used to design artificial heart valves- Can help in understand effects of disease (leaky valves)- Related projects look at the behavior of the heart during a heart
attack- Ultimately: real-time clinical work
![Page 27: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/27.jpg)
01/17/2007 CS267-Lecture 1 27
Heart Simulation Calculation
The involves solving Navier-Stokes equations- 64^3 was possible on Cray YMP, but 128^3 required for accurate model (would have taken 3 years).
- Done on a Cray C90 -- 100x faster and 100x more memory
- Until recently, limited to vector machines
- Needs more features:
- Electrical model of the heart, and details of muscles, E.g.,
- Chris Johnson
- Andrew McCulloch
- Lungs, circulatory systems
![Page 28: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/28.jpg)
01/17/2007 CS267-Lecture 1 28
Heart Simulation
Source: www.psc.org
Animation of lower portion of the heart
![Page 29: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/29.jpg)
01/17/2007 CS267-Lecture 1 29
Parallel Computing in Data Analysis
• Finding information amidst large quantities of data
• General themes of sifting through large, unstructured data sets:
- Has there been an outbreak of some medical condition in a community?
- Which doctors are most likely involved in fraudulent charging to medicare?
- When should white socks go on sale?
- What advertisements should be sent to you?
• Data collected and stored at enormous speeds (Gbyte/hour)
- remote sensor on a satellite
- telescope scanning the skies
- microarrays generating gene expression data
- scientific simulations generating terabytes of data
- NSA analysis of telecommunications
![Page 30: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/30.jpg)
01/17/2007 CS267-Lecture 1 30
Performance on Linpack Benchmarkwww.top500.org
![Page 31: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/31.jpg)
01/17/2007 CS267-Lecture 1 31
![Page 32: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/32.jpg)
01/17/2007 CS267-Lecture 1 32
Why writing (fast) parallel programs is hard
![Page 33: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/33.jpg)
01/17/2007 CS267-Lecture 1 33
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling
All of these things makes parallel programming even harder than sequential programming.
![Page 34: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/34.jpg)
01/17/2007 CS267-Lecture 1 34
“Automatic” Parallelism in Modern Machines• Bit level parallelism
- within floating point operations, etc.
• Instruction level parallelism (ILP)- multiple instructions execute per clock cycle
• Memory system parallelism- overlap of memory operations with computation
• OS parallelism- multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks
![Page 35: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/35.jpg)
01/17/2007 CS267-Lecture 1 35
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law- let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
- P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly performance is limited by the sequential part
![Page 36: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/36.jpg)
01/17/2007 CS267-Lecture 1 36
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to getting desired speedup
• Parallelism overheads include:- cost of starting a thread or process
- cost of communicating shared data
- cost of synchronizing
- extra (redundant) computation
• Each of these can be in the range of milliseconds (=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
![Page 37: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/37.jpg)
01/17/2007 CS267-Lecture 1
Locality and Parallelism
• Large memories are slow, fast memories are small
• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast cache- the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
![Page 38: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/38.jpg)
01/17/2007 CS267-Lecture 1 38
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
![Page 39: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/39.jpg)
01/17/2007 CS267-Lecture 1 39
Load Imbalance
• Load imbalance is the time that some processors in the system are idle due to
- insufficient parallelism (during that phase)
- unequal size tasks
• Examples of the latter- adapting to “interesting parts of a domain”
- tree-structured computations
- fundamentally unstructured problems
• Algorithm needs to balance load
![Page 40: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/40.jpg)
01/17/2007 CS267-Lecture 1 40
MeasuringPerformance
![Page 41: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/41.jpg)
01/17/2007 CS267-Lecture 1 41
Improving Real Performance
0.1
1
10
100
1,000
2000 2004T
eraf
lop
s1996
Peak Performance grows exponentially, a la Moore’s Law
In 1990’s, peak performance increased 100x; in 2000’s, it will increase 1000x
But efficiency (the performance relative to the hardware peak) has declined
was 40-50% on the vector supercomputers of 1990s
now as little as 5-10% on parallel supercomputers of today
Close the gap through ... Mathematical methods and algorithms that
achieve high performance on a single processor and scale to thousands of processors
More efficient programming models and tools for massively parallel supercomputers
PerformanceGap
Peak Performance
Real Performance
![Page 42: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/42.jpg)
01/17/2007 CS267-Lecture 1 42
Performance Levels
• Peak advertised performance (PAP)- You can’t possibly compute faster than this speed
• LINPACK - The “hello world” program for parallel computing
- Solve Ax=b using Gaussian Elimination, highly tuned
• Gordon Bell Prize winning applications performance- The right application/algorithm/platform combination plus years of work
• Average sustained applications performance- What one reasonable can expect for standard applications
When reporting performance results, these levels are often confused, even in reviewed publications
![Page 43: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/43.jpg)
01/17/2007 CS267-Lecture 1 43
Performance on Linpack Benchmarkwww.top500.org
![Page 44: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/44.jpg)
01/17/2007 CS267-Lecture 1 44
Performance Levels (for example on NERSC-3)
• Peak advertised performance (PAP): 5 Tflop/s
• LINPACK (TPP): 3.05 Tflop/s
• Gordon Bell Prize winning applications performance : 2.46 Tflop/s
- Material Science application at SC01
• Average sustained applications performance: ~0.4 Tflop/s
- Less than 10% peak!
![Page 45: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/45.jpg)
01/17/2007 CS267-Lecture 1 45
Course Organization
![Page 46: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/46.jpg)
01/17/2007 CS267-Lecture 1 46
Course Mechanics
• This class is listed as both a CS and Engineering class
• Normally a mix of CS, EE, and other engineering and science students
• This class seems to be about:- 15 grads + 4 undergrads- X% CS, EE, Other (BioPhys, BioStat, Civil, Mechanical, Nuclear)
• For final projects we encourage interdisciplinary teams- This is the way parallel scientific software is generally built
![Page 47: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/47.jpg)
01/17/2007 CS267-Lecture 1 47
Rough Schedule of Topics• Parallel Programming Models and Machines
- Shared Memory and Multithreading
- Distributed Memory and Message Passing
- Data parallelism
• Parallel languages and libraries - Shared memory threads and OpenMP
- MPI
- Languages (UPC)
• “Seven Dwarfs” of Scientific Computing- Dense Linear Algebra
- Structured grids
- Particle methods
- Sparse matrices
- Spectral methods
- Unstructured Grids
- Spectral methods (FFTs, etc)
• Applications: biology, climate, combustion, astrophysics, …
• Work: 3 homeworks + 1 final project
![Page 48: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/48.jpg)
01/17/2007 CS267-Lecture 1 48
Lecture Scribes
• Each student should scribe ~2 lectures• Next lecture is on single processor performance (and
architecture)• Sign up on list to choose 1 topic in first part of the course
![Page 49: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/49.jpg)
01/17/2007 CS267-Lecture 1 49
Reading Materials• Some on-line texts:
- Demmel’s notes from CS267 Spring 1999, which are similar to 2000 and 2001. However, they contain links to html notes from 1996.
- http://www.cs.berkeley.edu/~demmel/cs267_Spr99/
- Simon’s notes from Fall 2002
- http://www.nersc.gov/~simon/cs267/
- Ian Foster’s book, “Designing and Building Parallel Programming”.
- http://www-unix.mcs.anl.gov/dbpp/
• Potentially useful texts:- “Sourcebook for Parallel Computing”, by Dongarra, Foster, Fox, ..
- A general overview of parallel computing methods
- “Performance Optimization of Numerically Intensive Codes” by Stefan Goedecker and Adolfy Hoisie
- This is a practical guide to optimization, mostly for those of you who have never done any optimization
• More pointers will be on the web page
![Page 50: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/50.jpg)
01/17/2007 CS267-Lecture 1 50
What you should get out of the course
In depth understanding of:
• When is parallel computing useful?
• Understanding of parallel computing hardware options.
• Overview of programming models (software) and tools.
• Some important parallel applications and the algorithms
• Performance analysis and tuning
![Page 51: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/51.jpg)
01/17/2007 CS267-Lecture 1 51
Administrative Information
• Instructors:- Kathy Yelick, 777 Soda, [email protected]
- TA: Marghoob Mohiyuddin ([email protected])
- Office hours TBD
• Lecture notes are based on previous semester notes:- Jim Demmel, David Culler, David Bailey, Bob Lucas, Kathy Yelick
and Horst Simon
- Much material stolen from others (with sources noted)
• Most class material and lecture notes are at: - http://www.cs.berkeley.edu/~yelick/cs267_spr07
![Page 52: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/52.jpg)
01/17/2007 CS267-Lecture 1 52
Extra slides
![Page 53: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/53.jpg)
01/17/2007 CS267-Lecture 1 53
Transaction Processing
• Parallelism is natural in relational operators: select, join, etc.
• Many difficult issues: data partitioning, locking, threading.
(mar. 15, 1996)
0
5000
10000
15000
20000
25000
0 20 40 60 80 100 120Processors
Thro
ughput
(tp
mC
)
other
Tandem Himalaya
IBM PowerPC
DEC Alpha
SGI PowerChallenge
HP PA
![Page 54: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/54.jpg)
01/17/2007 CS267-Lecture 1 54
SIA Projections for Microprocessors
Compute power ~1/(Feature Size)3
0.01
0.1
1
10
100
10001
99
5
19
98
20
01
20
04
20
07
20
10
Year of Introduction
Fe
atu
re S
ize
(m
icro
ns
) &
Mill
ion
T
ran
sis
tors
pe
r c
hip
Feature Size(microns)
Transistors perchip x 106
based on F.S.Preston, 1997
![Page 55: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/55.jpg)
01/17/2007 CS267-Lecture 1 55
Much of the Performance is from Parallelism
Name
Bit-LevelParallelism
Instruction-LevelParallelism
Thread-LevelParallelism?
![Page 56: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.fdocuments.net/reader035/viewer/2022062500/56814ed0550346895dbc6c9a/html5/thumbnails/56.jpg)
01/17/2007 CS267-Lecture 1 56
Performance on Linpack Benchmark
System
0.1
1
10
100
1000
10000
100000Ju
n 93
Dec 9
3Ju
n 94
Dec 9
4Ju
n 95
Dec 9
5Ju
n 96
Dec 9
6Ju
n 97
Dec 9
7Ju
n 98
Dec 9
8Ju
n 99
Dec 9
9Ju
n 00
Dec 0
0Ju
n 01
Dec 0
1Ju
n 02
Dec 0
2Ju
n 03
Dec 0
3Ju
n 04
Rmax
max Rmax
mean Rmax
min Rmax
ASCI Red
Earth Simulator
ASCI White
Nov 2004: IBM Blue Gene L, 70.7 Tflops Rmax
Gflops
www.top500.org