Kari Tiensyrjä Senior Research Scientist VTT

12
24 May, 2005 1 Kari Tiensyrjä Senior Research Scientist VTT FP6-2004-IST-4 FET Proactive Initiative ACA SUPERcomputing on a CHIP: SUPERCHIP Proposal Number 26888 Jesper Larsson Träff Senior Principal Researcher NEC Europe Ben Juurlink Professor Delft University of Technology Ian Phillips Prof., Principal Staff Engineer ARM

description

Jesper Larsson Träff Senior Principal Researcher NEC Europe. Ian Phillips Prof., Principal Staff Engineer ARM. Ben Juurlink Professor Delft University of Technology. Kari Tiensyrjä Senior Research Scientist VTT. FP6-2004-IST-4 FET Proactive Initiative ACA SUPERcomputing on a CHIP: SUPERCHIP - PowerPoint PPT Presentation

Transcript of Kari Tiensyrjä Senior Research Scientist VTT

Page 1: Kari Tiensyrjä Senior Research Scientist VTT

24 May, 2005 SUPERCHIP Evaluation Hearing 1

Kari TiensyrjäSenior Research Scientist

VTT

FP6-2004-IST-4 FET Proactive Initiative ACA

SUPERcomputing on a CHIP: SUPERCHIPProposal Number 26888

Jesper Larsson TräffSenior Principal Researcher

NEC Europe

Ben JuurlinkProfessor

Delft University of Technology

Ian PhillipsProf., Principal Staff Engineer

ARM

Page 2: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 2

1. Paths to exploitation

• FET project with potential for application breakthroughs in a 10+ years horizon

• Industrial Partners (NEC, ARM, Intel) cover a wide spectrum of application domains and provide:

• Steering of scientific and technological research• Transfer of knowledge and results to and interplay with company design groups

• Proposition to standardization bodies, where relevant (B.3.6)

• Active promotion of results (T6.1 and T6.2):• High-profile scientific and applied conferences and journals• Organization of workshops• PhD courses and summer schools, incorporation into advanced curricula• Links to NoE’s

• WP6 (led by Intel): dissemination and exploitation (also: B.3.3, B.4.1.7, and B.8.2.6)• T6.3 for technology transfer• T6.4 for exploitation

Page 3: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 3

2. Target applications

• Desktops and servers (versatility from high-performance/single-application to high-throughput application suites)

• Streaming and DSP applications, e.g. video in bandwidth constrained active networks and embedded 3D graphics

• Real-time speech recognition and videoconferencing

• Database applications, string processing, geographical information processing

• Supercomputer (high-performance)

• Vectorised CFD Boltzmann automata

• MPI-parallelised finite element methods

• Quantum Chromodynamics

• Mobile devices (energy-efficiency)

• PDA, HDTV

• Games, virtual reality

• Wide range of applications with high computational requirements will be considered

• WP4 will analyse and identify applications, and selected sample applications will be implemented as proof-of-concept

• An initial set of applications considered:

Page 4: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 4

3. Leading contenders within the proposal

• Objectives: to boost performance by 2-3 orders of magnitude (compared to same transistor count), exploit parallelism at all levels, realise easy-to-use strong model of computing, provide scalability/wide application area/power saving techniques

Eclipse XMT CMP TTA/PISMA TRIPS

-Scalable NOC with EREW PRAM model- Simultaneous ILP-TLP exploitation- Cacheless memory-Regular structure

- CMP with PRAM-like but more asynchronous model- SMT + synchronization mechanism- On-chip caches

- Shared memory using caches + advanced cache coherency protocols

- Tiled architecture with virtual shared memory communication- Very simple and strongly decentralized organization

-Single chip reconfigurable processor / memory architecture-Grids of ALUs connected via operand networks-Static spatial scheduling

P ProsessoriydinS Kytkin/reititinI/O Syöttö/tulostuslai te

I KäskymuistimoduliM Datamuistimodul i

S

S

I P

S

S S

S

I P

S

S S

S

I P

S

S S

S

I P

M

S

S

S

S

I P

M

S

S S

S

I P

M

S

S S

S

I P

M

S

S S

S

I P

M

S

S

S

S

I P

M

S

S S

S

I P

M

S

S S

S

I P

M

S

S S

S

I P

M

S

S

S

S

I P

M

S

S S

S

I P

M

S

S S

S

I P

M

S

S S

S

I P

M

S

S

I/O

I/O

I/O

I/O I/O I/O

I/O

M

I/O I/O

M M

I/O

I/O

I/O

I/O

I/O

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

P M

M P

Super-Tile Interface Unit

Super-Tile Interface Unit

ST I U n i t

ST I U n i t

Page 5: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 5

3. Leading contenders within the proposal (cont)

• Initial choice of architectures is partially guided by application requirements:• Eclipse and XMT: general purpose computing, embedded computing• Advanced CMP: high-throughput desktop and server machines• TTA/PISMA: streaming/DSP• TRIPS: HPC, streaming/DSP, threaded servers

• Procedure to choose the initial SUPERCHIP architecture:1. Develop an architecture evaluation framework (T1.1)2. Develop semi-analytical power/performance/cost models (T5.1)3. Develop/modify existing simulators for the architectures (T5.2)4. Design benchmark programs for the architectures (T4.1)5. Perform evaluation + identify strong/weak points + select (T1.1)

• Preliminary criteria:• Power, performance, cost (silicon area)• Estimated scalability, PRAM-like model support, ease of programming• Estimated coverage for aimed application area, TLP-ILP co-exploitation• Potential for solving the rest of the problems

Page 6: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 6

4. Ensuring HW implementation technologies impact on choice of scalable architecture

• Scalability issues are observed in initial selection of candidate architectures• Mesh-like topologies (providing constant wire length links): Eclipse, CMP,

TTA, TRIPS• Regular structures: Eclipse, CMP, TTA, TRIPS• No forwarding networks (Eclipse) or multistage forwarding networks (TRIPS)• No cache coherency mechanisms: Eclipse• Multithreading: Eclipse, XMT• Decentralized structure: Eclipse, CMP, TTA, TRIPS

• Semi-analytical modeling of the architectures and candidate techniques (T5.1)• Analytical parametric power/performance/cost estimation models• Hardware implementation parameters are extracted from

• Technology roadmaps e.g. ITRS• Pragmatic experience and knowledge of industrial partners

Page 7: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 7

4. Ensuring HW implementation technology impact on our choice of scalable architecture (cont)

• Architectural simulation (T5.2)• Develop/modify existing simulators• Benchmarks• Sample applications• Information on execution time, resource utilization and power

consumption is extracted• Modeling of the critical parts of architectures

• Feasibility analysis of candidate architectures• Studies on fault tolerance, clocking schemes, on-chip/off-chip

communication, power saving and other implementation related issues for the SUPERCHIP architecture (T5.3)

• Detailed modeling and feasibility assessment of critical parts of the SUPERCHIP architecture (T5.4)

Page 8: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 8

5. Evolvement of the PRAM model for the candidate architectures

• Architectural requirements:• Synchronization: implicit after each instruction• Bandwidth: high bisection to handle random communication• Latency: communication/memory access latency should be hidden

• For ease-of-programming the SUPERCHIP programming model will be based on a PRAM-like model, considering

•Relaxed synchronization (BSP-like)

•Strong memory semantics (CRCW-like, built-in operators)

•Potential for locality exploitation (memory, Hierarchical-PRAM)

• SUPERCHIP will develop the necessary architectural support for this model

• SUPERCHIP will not investigate PRAM-implementation on distributed memory architectures in general

• Long-term research issue: Evolution of programming model and architecture to SUPERCHIP constellations

Page 9: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 9

5. Evolvement of the PRAM model for the candidate architectures (cont)

Candi-date

Synchronization Bisection bandwidth

Latency hiding Initial model

Eclipse synchronization wave fast barrier mechanism

P/2 Super-pipelined multithreading

EREW PRAM

XMT hardware synchronization

? caches PRAM-like

CMP software

synchronization

square root P caches NUMA

TTA/PISMA

software

synchronization

square root P caches NUMA

TRIPS software

synchronization

square root P caches NUMA

Page 10: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 10

6. Validation and assessment of the performance scalability of the final choice of HW/SW architecture

• Analytically through parametric power/performance/cost models

• Empirically through simulations• Benchmark kernels and sample applications

• Scalable benchmark suite for fine-grained shared memory architecture

• Standard benchmark suites• Sample applications

• Parametric architecture simulations

• By comparing to future alternative approaches (e.g. advanced CMPs) and theoretical machines (e.g. ideal PRAM) using the applications and benchmarks

Page 11: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 11

7. Plan for identifying the requirements for the OS within the resources of the work plan

• Goal is to identify requirements and implement core OS services to demonstrate validity of the architectural approach, but not to develop full-fledged OS (as stated in B.4.1.5):

• Requirements from underlying architecture and applications• Resource management (process, thread and memory)• Runtime functions and services for applications

• Input for identifying requirements will come from several other tasks including T1.2, T1.3, T2.2 and T3.3

• OS is not in charge of supporting distributed shared memory• Certain OS functionality will be covered by compiler’s run-time system

• Task leader of OS task (T4.3, ULM) has developed a distributed operating system (Plurix) which provides an excellent basis

Page 12: Kari Tiensyrjä Senior Research Scientist VTT

VTT TECHNICAL RESEARCH CENTRE OF FINLAND

24 May, 2005 SUPERCHIP Evaluation Hearing 12

7. Plan for identifying the requirements for the OS within the resources of the work plan (cont)

• Preliminary anticipated OS requirements• Dynamic process/thread scheduling• Memory management (physical and virtual)• Synchronization including inter-process communication• Support for power management and IO

• Definition• A coarse-grain functional model of OS will be developed and validated

through simulation• Definition of API in SUPERCHIP language (or pseudo-language in the

early phase)• Implementation

• Using the SUPERCHIP language and compiler (from T2.2 and T3.3)• Testing with architecture simulation tools (from T5.2)

Feasible with the allocated resources and partners