Lecture 1 (Overview)

116
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 1 (Overview)

description

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lecture 1 (Overview). Course Objectives. A hands-on opportunity to learn: Multi-core architectures; and Programming multi-core systems Emphasis on programming: - PowerPoint PPT Presentation

Transcript of Lecture 1 (Overview)

Page 1: Lecture 1 (Overview)

Programming Multi-Core Processors based

Embedded Systems

A Hands-On Experience on Cavium Octeon based Platforms

Lecture 1 (Overview)

Page 2: Lecture 1 (Overview)

KICS, UET1-8

Course Objectives

A hands-on opportunity to learn: Multi-core architectures; and Programming multi-core systems

Emphasis on programming: Using multi-threading paradigm Understand the complexities Apply to generic computing/networking

problems Implement on an popular embedded multi-

core platform

Page 3: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Grading Policy and Reference Books

Grading Policy Lectures (40%) Labs (50%) Quizzes (daily) (10%)

Reference material: Shameem Akhtar and Jason Roberts, Multi-

Core Programming, Intel Press, 2006 David E. Culler and Jaswinder Pal Singh,

Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998

Class notes

Page 4: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Course Outline

Introduction Parallel architectures and terminology Context for current interest: multi-core

processors Programming paradigms for multi-core Octeon processor architecture

Multi-threading on multi-core processors Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning

Page 5: Lecture 1 (Overview)

An Introduction to Parallel Computing in the context of

Multi-Core Architectures

Page 6: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Developing Software for Multi-Core: A Paradigm Shift

Application developers are typically oblivious of underlying hardware architecture Sequential program Automatic/guaranteed performance benefit

with processor upgrade No work on the programmer

No “free lunch” with multi-core systems Multiple cores in modern processors Parallel programs needed to exploit parallelim Parallel computing is now part of main-stream

Page 7: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Parallel Computing for Main-Stream: Old vs. New Programming Paradigms

Known tools and techniques: High performance computing and communication

(HPCC) Wealth of existing knowledge about parallel

algorithms, programming paradigms, languages and compilers, and scientific/engineering applications

Multi-threading for multi-core Common in desktop and enterprise applications Exploits parallelism of multi-core with its challenges

New realizations of old paradigms: Parallel computing on Playstation 3 Parallel computing on GPUs Cluster computing for large volume data

Page 8: Lecture 1 (Overview)

KICS, UETCopyright © 2010 Cavium University Program 1-8

Dealing with the Challenge of Multi-Core Programming with Hands-On Experience

Our objective is two-fold Overview the known paradigms for background Learn using the state-of-the-art implementations

Choice of platform for hands-on experience Cavium Networks’ Octeon processor based system Multiple cores (1 to 16) Suitable for embedded products Commonly used in networking products Standard Linux based development environment

Page 9: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Agenda for Today

Parallel architectures and terminology Processor technology trends Architecture trends Taxonomy

Why multi-core architectures? Traditional parallel computing Transition to multi-core architectures

Programming paradigms Traditional Recent additions

Introduction to Octeon processor based systems

Page 10: Lecture 1 (Overview)

Architecture and Terminology

Background of parallel architectures and commonly used

terminology

Page 11: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Architectures and Terminology

Objectives of this section: Understand the processor technology trends Realize that parallel architectures evolve

based on technology and architecture trends

Terminology used in parallel computing Von Neumann Flynn’s taxonomy Bell’s taxonomy Other commonly used terminology

Page 12: Lecture 1 (Overview)

Processor Technology Trends

Processor technology evolution, Moore’s law, ILP, and current

trends

Page 13: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Processor Technology Evolution

Increasing number of transistors on a chip Moore’s law: number of transistors on a chip is

expected to double every 18 months Chip densities are reaching their physical limits Technological breakthroughs have kept Moore’s law

alive Increasing clock rates during 90’s

Faster and smaller transistors, gates, and circuits on a chip

Clock rates of microprocessors increase by ~30% per year

Benchmark (e.g., SPEC suite) results indicate performance improvement with technology

Page 14: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Moore’s Law

Gordon Moore , Founder of Intel 1965: since the integrated circuit was

invented, the number of transistors/inch2 in these circuits roughly doubled every year this trend would continue for the foreseeable future

1975: revised - circuit complexity doubles every 18 months

This was simply a prediction Based on little data However, it has defined the processor industry

Page 15: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Moore’s Original Law (2)

ftp://download.intel.com/research/silicon/moorespaper.pdf

Page 16: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Moore’s Original Issues

Design cost still valid Power dissipation still valid What to do with all the functionality possible

ftp://download.intel.com/research/silicon/moorespaper.pdf

Page 17: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Moore’s Law and Intel Processors

From: http://www.intel.com/technology/silicon/mooreslaw/pix/mooreslaw_chart.gif

Page 18: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Good News: Moore’s Law isn’t done yet

Source: Webinar by Dr. Tim Mattson, Intel Corp.

Page 19: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Bad News:Single Thread Performance is Falling Off

Page 20: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Worse News:Power (normalized to i486) Trend

Source: Webinar by Dr. Tim Mattson, Intel Corp.

Page 21: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Addressing Power Issues

Source: Webinar by Dr. Tim Mattson, Intel Corp.

Page 22: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Architecture Optimized for Power:a big step in the right direction

Source: Webinar by Dr. Tim Mattson, Intel Corp.

Page 23: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Long term solution: Multi-Core

Source: Webinar by Dr. Tim Mattson, Intel Corp.

Page 24: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Summary of Technology Trends

Moore’s law is still relevant Need to deal with related issues

Design complexity Power consumption Uniprocessor performance is slowing down

Multiple processor cores resolve these issues Parallelism at hardware level End user is exposed to it Added complexities related to programming

such systems

Page 25: Lecture 1 (Overview)

Taxonomy for Parallel Architectures

Von Neumann, Flynn, and Bell’s taxonomies and other common

terminology

Page 26: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Von Neumann Architecture Evolution

Scalar

Sequential Lookahead

I/E overlap Functionalparallelism

Multiple func.units Pipeline

Implicitvector

Explicitvector

Memory tomemory

Register toregister

SIMD MIMD

Associativeprocessor

Processorarray Multicomputer Multiprocessor

Von Neumann architecture

Massively Parallel Processors

Page 27: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Pipelining and Parallelism

Instructions prefetch to overlap execution Functional parallelism supported by:

Multiple functional units Pipelining

Pipelining Pipelined instruction execution Pipelined arithmetic computations Pipelined memory access operations Pipelining is attractive for performing identical

operations repeatedly over vector data strings

Page 28: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Flynn’s Classification

Michael Flynn classified architectures in 1972 based on instruction and data streams

Single Instruction stream over a Single Data stream (SISD) Conventional sequential machines

Single Instruction stream over Multiple Data streams (SIMD) Vector computers are equipped with scalar

and vector hardware

Page 29: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Flynn’s Classification (2)

Multiple Instruction streams over Single Data stream (MISD) Same data flowing through a linear array of

processors Aka systolic arrays for pipelined execution

of algorithms Multiple Instruction streams over

Multiple Data streams (MIMD) Suitable model for general purpose parallel

architectures

Page 30: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Bell’s Taxonomy for MIMD

Multicomputers Multiple address

spaces System consists of

multiple computers, called nodes

Nodes are interconnected by a message-passing network

Each node has its own processor, memory, NIC, and I/O devices

Multiprocessors Shared address space Further classified

based on how memory is accessed

Uniform Memory Access (UMA)

Non-Uniform Memory Access (NUMA)

Cache-Only Memory Access (COMA)

Cache-Coherent Non-Uniform Memory Access (cc-NUMA)

Page 31: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multicomputer Genenrations

First generation (1983-87) Processor boards connected in hypercube architecture Software-controlled message switching Examples: Caltech Cosmic, Intel iPSC/1

Second generation (1988-1992) Mesh connected architecture Hardware message routing Software environment for medium grain distributed

computing Example: Intel Paragon

Third generation (1993-1997) Fine grain multicomputers Examples: MIT J-Machine and Caltech Mosaic

Page 32: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multiprocessor Examples

Distributed memory (scalable) Dynamic binding of address to processors (KSR) Static binding, caching (Alliant, DASH) Static program binding (BBN, Cedar)

Central memory (not scalable) Cross-point or multi-stage (Cray, Fujitsu,

Hitachi, IBM, NEC, Tera) Simple multi bus (DEC, Encore, NCR,

Sequent, SGI, Sun)

Page 33: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Supercomputers

Supercomputers use vector processing and data parallelism

Classified into two categories Vector supercomputers SIMD supercomputers

SIMD machines with massive data parallelism Instruction is broadcast to large number of Pes Examples: Illiac (64 PEs), MasPar MP-1 (16,384

PEs), and CM-2 (65,538 PEs)

Page 34: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Vector supercomputers

Machines with powerful vector processors If decoded instruction is a vector

operation, it is sent to vector unit Register-register architecture:

Fujitsu VP2000 series Memory-to-memory architecture:

Cyber 205 Pipelined vector supercomputers:

Cray Y-MP

Page 35: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Dataflow Architectures

Represent computation as a graph of essential dependences Logical processor at each node, activated by

availability of operands Message (tokens) carrying tag of next

instruction sent to next processor Tag compared with others in matching store;

match fires execution

Page 36: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Dataflow Architectures (2)1 b

a

+

c e

d

f

Dataflow graph

f = a d

Network

Tokenstore

WaitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Programstore

a = (b +1) (b c)d = c e

Page 37: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Systolic Architectures

Replace single processor with array of regular processing elements

Orchestrate data flow for high throughput with less memory access

M

PE

M

PE PE PE

Page 38: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Systolic Architectures (2)

Different from pipelining Nonlinear array structure, multidirection

data flow, each PE may have (small) local instruction and data memory

Different from SIMD: each PE may do something different

Initial motivation: VLSI enables inexpensive special-purpose chips

Represent algorithms directly by chips connected in regular pattern

Page 39: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Systolic Arrays (Cont’d)• Example: Systolic array for 1-D convolution

Practical realizations (e.g. iWARP) use quite general processors Enable variety of algorithms on same hardware

But dedicated interconnect channels Data transfer directly from register to register across channel

Specialized, and same problems as SIMD General purpose systems work well for same algorithms (locality etc.)

y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)

x8

y3 y2 y1

x7x6

x5x4

x3

w4

x2

x

w

x1

w3 w2 w1

xin

yin

xout

yout

xout = x

yout = yin + w xinx = xin

Page 40: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Cluster of Computers

Started as a poor man’s parallel system In-expensive PCs In-expensive switched Ethernet switch Run-time system to support message-passing Low performance for HPCC applications

High network I/O latency Low bandwidth

Suitable for high throughput applications Data center applications Virtualized resources Independent threads or processes

Page 41: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Summary of Taxonomy

Multiple taxonomies Based on functional parallelism

Von Neumann and Flynn’s taxonomies Based on programming paradigm

Bell’s taxonomy Parallel architecture types

Multi-computers (distributed address space) Multi-processors (shared address space) Multi-core Multi-threaded Others: vector, data flow, systolic, and cluster

Page 42: Lecture 1 (Overview)

Why Multi-Core Architectures

Based on technology and architecture trends

Page 43: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multi-Core Architectures

Traditional architectures Sequential Moore’s law = increasing clk

freq Parallel Diminishing returns from ILP

Transition to multi-core Architecture similar to SMPs Programming typically SAS

Challenges to transition Performance = efficient parallelization Selecting a suitable programming paradigm Performance tuning

Page 44: Lecture 1 (Overview)

Traditional Parallel Architectures

Definition and development tracks

Page 45: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Defining a Parallel Architecture

A sequential architecture is characterized by: Single processor Single control flow path

Parallel architecture: Multiple processors with interconnection network Multiple control flow paths Communication and synchronization

A parallel computer can be defined as a collection of processing elements that communicate and cooperate to solve large problems fast

Page 46: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Broad Issues in Parallel Architectures

Resource allocation: how large a collection? how powerful are the elements? how much memory?

Data access, communication and synchronization how do the elements cooperate and communicate? how are data transmitted between processors? what are the abstractions and primitives for

cooperation? Performance and scalability

how does it all translate into performance? how does it scale?

Page 47: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

General Context: Multiprocessors

Multiprocessor is any computer with several processors

SIMD Single instruction, multiple data Modern graphics cards

MIMD Multiple instructions, multiple data

Lemieux cluster,Pittsburgh

supercomputing center

Page 48: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Architecture Developments Tracks

Multiple-Processor Tracks Shared-memory track Message-passing track

Multivector and SIMD Tracks Multithreaded and Dataflow Tracks Multi-core track

Page 49: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Shared-Memory Track Starts with C.mmp system developed at CMU in

1972 UMA multiprocessor with 16 PDP 11/40 processors Connected to 16 shared memory modules via crossbar

switch Pioneering multiprocessor OS (Hydra) development

effort Illinois Cedar (1987)

IBM RP3 (1985) BBN Butterfly (1989)

NYU/Ultracomputer (1983) Stanford/DASH (1992) Fujitsu VPP500 (1992) KSR1 (1990)

Page 50: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Message-Passing Track

The Cosmic Cube (1981) pioneered message-passing computers

Intel iPSCs (1983) Intel Paragon (1992) Medium-grain multicomputers

nCUBE-2 (1990) Mosaic (1992)

MIT/J Machine (1992) Fine-grain multicomputers

Page 51: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multivector Track

CDC 7600 (1970) Cray 1 (1978)

Cray Y-MP (1989) Fujitsu, NEC, Hitachi models

CDC Cyber205 (1982) ETA 10 (1989)

Page 52: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

SIMD Track

Illiac IV (1968) Goodyear MPP (1980)

DAP 610 (AMT, Inc. 1987) CM2 (TMC, 1990)

BSP (1982) IBM GF/11 (1985) MasPar MP1 (1990)

Page 53: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multi-Threaded Track

Each processor can execute multiple threads of control at the same time

Multi-threading helps hide long latencies in building large-scale multiprocessors

Multithreaded architecture was pioneered by Burton Smith in HEP system (1978)

MIT/Alewife (1989) Tera (1990)

Symmetric Multi-Threading (SMT) Two hardware threads Available in Intel processors

Page 54: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Dataflow Track

Instead of control-flow in von Neumann architecture, dataflow architectures are based on dataflow mechanism

Dataflow concept was pioneered by Jack Dennis (1974) with a static architecture

This concept later inspired dynamic dataflow MIT tagged token (1980) Manchester (1982)

Page 55: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multi-Core Track

Intel Dual core processors AMD Opteron IBM Cell processor SUN processors Cavium processor Freescale processors

Page 56: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multi-Core processor is a special kind of a Multiprocessor

All processors are on the same chip Multi-core processors are MIMD:

Different cores execute different threads (Multiple Instructions),

Operate on different parts of memory (Multiple Data)

Multi-core is a shared memory multiprocessor: All cores share the same address space

Page 57: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Summary: Why Parallel Architecture? Increasingly attractive

Economics, technology, architecture, application demand Readily available multi-core processors based systems

Increasingly central and mainstream

Parallelism exploited at many levels Instruction-level parallelism Multiple threads of software Multiple cores Multiprocessor processors

Focus of this class:

Multiple cores and/or processors

Programming paradigms

Page 58: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Example: Intel Pentium Pro Quad

All coherence and multiprocessing glue in processor module

Highly integrated, targeted at high volume

Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

Page 59: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Example: SUN Enterprise

16 cards of either type: processors + memory, or I/O All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Page 60: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Example: Cray T3E

Scale up to 1024 processors, 480MB/s links Memory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this)

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

Page 61: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Example: IBM SP-2

Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bw limited by I/O bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Page 62: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

Page 63: Lecture 1 (Overview)

Transition to Multi-Core

Selected multi-core processors

Page 64: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Single to Multiple Core Transition

Page 65: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Multi-Core Processor Architectures

Source: Michael Perrone of IBM

Page 66: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

IBM POWER 4

Shipped in Dec. 2004 Technology: 180 nm

lithography Dual processor cores 8-way superscalar

Out of order execution 2 load/store units 2 fixed and fp units >200 insts in flight

Hardware instruction and data pre-fetch

Page 67: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

IBM POWER 5

Shipped in Aug. 2003 Technology: 130 nm

lithgraphy Dual processor core 8-way superscalar SMT core

Up to 2 virtual cores Natural extension to

POWER4 design

Page 68: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

IBM Cell Processor IBM, Sony, and Toshiba

alliance in 2000 Based on IBM POWER5

64 bit architectures 9 cores, 10 threads

1 dual thread power processor element (PPE) for control

8 synergistic processor element (SPE) for processing

Up to 25 GB/s memory bw

Up to 75 GB/s I/O bw >300 GB/s Element

Interface Bus bandwidth

source: Dr. Michael Perrone, IBM

Page 69: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Future Multi-Core Platforms

Page 70: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

A Many Core Example:Intel’s 80 Core Test Chip

Page 71: Lecture 1 (Overview)

Programming the Multi-Core

Programming, OS interaction, applications, synchronization, and

scheduling

Page 72: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Parallel Computing is Ubiquitous

Over the next few years, all computing devices will be parallel computers

Servers Laptops Cell phones

What about software? Herb Sutter of Microsoft said in Dr. Dobbs’ Journal:

The free lunch is over: Fundamental Turn towards Concurrency in software

Software will no longer increase from one generation to the next as hardware improves … unless it is parallel software

Page 73: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Interaction with OS

OS perceives each core as a separate processor Same for SMP processor Linux SMP kernel works on multi-core

processors OS scheduler maps threads/processes

to different cores Migration is possible Also, possible to pin processes to processors

Most major OSes support multi-core today:Windows, Linux, Mac OS X, …

Page 74: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

What Applications Benefit from Multi-Core? Database servers Web servers (Web

commerce) Compilers Multimedia applications Scientific applications,

CAD/CAM In general, applications

with Thread-level parallelism(as opposed to instruction-level parallelism)

Each can run on itsown core

Page 75: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

More examples

Editing a photo while recording a TV show through a digital video recorder

Downloading software while running an anti-virus program

“Anything that can be threaded today will map efficiently to multi-core”

BUT: some applications difficult toparallelize

Page 76: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Programming for Multi-Core Programmers must use threads or processes

Threads are relevant for business/desktop apps Multiple processes with complex systems or SPMD

based parallel applications Spread the workload across multiple cores

OS maps threads/processes to cores Transparent to the programmer

Write parallel algorithms True for scientific/engineering applications Programmer needs to define the mapping to cores

Page 77: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Thread Safety is very important

Pre-emptive context switching:context switch can happen AT ANY TIME

True concurrency, not just uniprocessor time-slicing

Concurrency bugs exposed much faster with multi-core

Page 78: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

However: Need to use synchronization even if only time-slicing on a uniprocessor

int counter=0;

void thread1() { int temp1=counter; counter = temp1 + 1;}

void thread2() { int temp2=counter; counter = temp2 + 1;}

Page 79: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Need to use synchronization even if only time-slicing on a uniprocessor

temp1=counter;

counter = temp1 + 1;

temp2=counter;

counter = temp2 + 1

temp1=counter;

temp2=counter;

counter = temp1 + 1;

counter = temp2 + 1

gives counter=2

gives counter=1

Page 80: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Assigning Threads to the Cores

Each thread/process has an affinity mask

Affinity mask specifies what cores the thread is allowed to run on

Different threads can have different masks

Affinities are inherited across fork()

Page 81: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Affinity Masks are Bit Vectors

Example: 4-way multi-core, without SMT

1011

core 3 core 2 core 1 core 0

• Process/thread is allowed to run on cores 0,2,3, but not on core 1

Page 82: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Affinity Masks when Multi-Core and SMT are Combined

Separate bits for each simultaneous thread

Example: 4-way multi-core, 2 threads per core1

core 3 core 2 core 1 core 0

1 0 0 1 0 1 1

thread1

• Core 2 can’t run the process• Core 1 can only use one simultaneous thread

thread0

thread1

thread0

thread1

thread0

thread1

thread0

Page 83: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Default Affinities

Default affinity mask is all 1s:all threads can run on all processors

Then, the OS scheduler decides what threads run on what core

OS scheduler detects skewed workloads, migrating threads to less busy processors

Page 84: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Process Migration is Costly

Need to restart the execution pipeline Cached data is invalidated OS scheduler tries to avoid migration as

much as possible: it tends to keeps a thread on the same core

This is called soft affinity

Page 85: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Hard Affinities

The programmer can prescribe their own affinities (hard affinities)

Rule of thumb: use the default scheduler unless a good reason not to

Page 86: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

When to Set your own Affinities

Two (or more) threads share data-structures in memory map to same core so that can share cache

Real-time threads:Example: a thread running a robot controller: must not be context switched,

or else robot can go unstable dedicate an entire core just to this thread

Source: Sensable.com

Page 87: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Kernel Scheduler API

#include <sched.h>

int sched_getaffinity(pid_t pid, unsigned int len, unsigned long * mask);

Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’.

‘len’ is the system word size: sizeof(unsigned int long)

Page 88: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Kernel Scheduler API

#include <sched.h>

int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask);

Sets the current affinity mask of process ‘pid’ to *mask

‘len’ is the system word size: sizeof(unsigned int long)

To query affinity of a running process: $ taskset -p 3935pid 3935's current affinity mask: f

Page 89: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Windows Task Manager

core 2

core 1

Page 90: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Summary

Reasons for multi-core based systems: Processor technology trends Architecture trends and diminishing returns Application trends

Background of using multi-core systems: Traditional parallel architectures Experience with parallelization Transition to multi-core is happening

Challenge: programming the multi-core

Page 91: Lecture 1 (Overview)

Programming Paradigms

Traditional as well as new additions

Page 92: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Programming Paradigms In the context of multi-core systems

Architecture is similar to SMPs SAS programming works However, message-passing is also possible

Traditional paradigms and their use for parallelization

Explicit message passing (MP) Shared address space (SAS) Multi-threading as a parallel computing model

New realizations of traditional paradigms Multi-threading on Sony Playstation 3 MapReduce CUDA

Page 93: Lecture 1 (Overview)

Traditional Programming Models

David E. Culler and Jaswinder Pal Singh, Parallel Computer

Architecture: A Hardware/Software Approach, Morgan Kaufmann,

1998

Page 94: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Parallel Programming Models

Message passing Each process uses its private address space Access to non-local address space through

explicit message passing Full control of low level data movement

Data parallel programming Extension of message passing Parallelism based on data decomposition

and assignment

Page 95: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Parallel Programming Models (2)

Shared memory programming Single address space (SAS) Hardware manages low level data movement Primary issue is coherence

Multi-threading An extension of SAS Commonly used on SMP as well as UP systems

Libraries and building blocks Often specific to architectures Help reduce the development effort

Page 96: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Message Passing

Message passing operations Point-to-point unicast Collective communication

Efficient collective communication can greatly enhance performance

Characteristics of message-passing: Programmer’s control over interactions Customizable to leverage hardware features Tunable to enhance performance

Page 97: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Distributed Memory Issues

Load balancing Programmer has explicit control over load balancing

Data locality High performance depends on keeping processor

busy Required data should be kept locally (decomposition

and assignment) Data distribution

Data distribution affects the computation-to-communication ratio

Low level data movement and process control Data movement result from algorithmic structure

Page 98: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Data Movement and Process Control

Types of low level data movements

Replication Reduction Scatter/gather Parallel prefix,

segmented scan Permutation

Process control involves

Barrier synchronization Global conditionals

Data movement results in communication

High performance depends on efficient data movement

Overhead results due to processes waiting for data

Page 99: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Message Passing Libraries

Motivation Portability: application need not be

reprogrammed on a new platform Heterogeneity Modularity Performance

Implementations PVM, Express, P4, PICL, MPICH, etc. MPI has become a de facto API standard

Page 100: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Data Parallel Programming

This is an extension of message passing based programming

Instead of explicit message passing, parallelism is implemented implicitly in languages Parallelism comes from distributing data

structures and control Implemented through hints for compilers

Page 101: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Data Parallel Programming (2)

Motivations Explicit message passing programs are

difficult to write and debug Data distribution results in computation on

local data Useful for massively parallel distributed-

memory as well as cluster of workstations Developed as extensions to existing

languages Fortran, C/C++, Ada, etc.

Page 102: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

High Performance Fortran (HPF)

HPF defines language extensions to Fortran To support data parallel programming Compiler generates a message-passing

program Features include:

Data distribution and alignment Purpose is to reduce inter-processor

communication DISTRIBUTE, ALIGN, and PROCESSORS directives

Page 103: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

HPF (2)

Features (cont’d): Parallel statements

Provide ability to express parallelism FORALL, INDEPENDENT, and PURE directives

Extended intrinsic functions and standard library

EXTRINSIC procedures Mechanism to use efficient machine-specific primitives

Limitations Useful only for FORTRAN code Not useful outside of HPCC domain

Page 104: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Shared Memory Programming Single address space view of the entire

memory Logical for the programmer Easier to use

Parallelism comes from loops Require loop level dependence analysis Various iterations of a loop that are independent wrt

index can be scheduled on different processors Fine grained parallelism

Requires hardware support for memory consistency to avoid overhead

Page 105: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Shared Memory Programming (2)

Performance tuning is non-trivial Data locality is critical Reduce the grain size by parallelizing the outer-most

loop Require inter-procedural dependence analysis Very few parallelizing compilers that can analyze and

parallelize even reasonable size of application codes

Simplifies parallel programming Sequential code is directly usable with compiler hints

for parallel loops OpenMP standardizes the compiler directives

for shared memory parallel programming

Page 106: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Libraries and Building Blocks

Pre-implemented parallel code that may work as kernels of larger well-known parallel applications Several numerical methods are reusable in

various applications; Example:

linear equation solver, partial differential equations, matrix based operations etc.

Page 107: Lecture 1 (Overview)

KICS, UETCopyright © 2010 Cavium University Program 1-8

Libraries and Building Blocks (2)

Low level operations, such as data movement, can be used for efficient matrix operations Matrix transpose Matrix multiply LU decomposition

These building blocks can be implemented in distributed-memory numerical libraries Examples:

ScaLPACK: dense and banded matrices Massively parallel LINPACK

Page 108: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Summary: Traditional Paradigms

Message-passing based programming Suitable for distributed memory multi-

computers Multiple message-passing libraries exist MPI is a de facto standard

Shared address space programming Suitable for SMPs Based on low-granularity loop-level parallelism Implemented through compiler directives OpenMP is a de facto standard

Page 109: Lecture 1 (Overview)

Threading as a parallel programming model

Types of applications and threading environments

Page 110: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Parallel Programming with Threads

Two types of applications: Compute-intensive applications HPCC systems Throughput-intensive applications Distributed

systems Threads help in different ways

By dividing data reduces work per thread By distributing work enables concurrency

Programming languages that support threads C/C++ or C# Java OpenMP directives for C/C++ code

Page 111: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Virtual Environments

VMs and platforms

Runtime virtualization

System virtualization

Page 112: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Systems with Virtual Machine Monitors

Page 113: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Summary: Threading for Parallelization

Threads can exploit parallelism on multi-core Scientific and engineering applications High throughput applications

Shared memory programming realized through:

Pthread OpenMP

Challenges of multi-threading: Synchronization Memory consistency is the dark side of shared

memory

Page 114: Lecture 1 (Overview)

Octeon Processor Acrchitecture

Target system processor and system architecture and

programming environment

Page 115: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Cavium Octeon Processor MIPS64

architecture

Up to 16 cores

Shared L2 cache

Security accelerator

Used in network security gateways

source:

http://www.caviumnetworks.com/OCTEON_CN38XX_CN36XX.html

Page 116: Lecture 1 (Overview)

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Key Takeaways for Today’s Session

Multi-core systems are here to stay Processor technology trends Application performance requirements

Programming is a challenge No “free lunch” Programmers need to parallelize Extra effort to tune the performance

Programming paradigms Message passing Shared address space based programming Multi-threading focus for this course