Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering...

35
Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation [email protected]

Transcript of Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering...

Page 1: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

Tile Processors: Many-Core for Embedded

and Cloud Computing

Richard SchoolerVP Software Engineering

Tilera [email protected]

Page 2: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.2

Exploiting Natural Parallelism

High-performance applications have lots of parallelism!– Embedded apps:

• Networking: packets, flows• Media: streams, images, functional & data parallelism

– Cloud apps:• Many clients: network sessions• Data mining: distributed data & computation

Lots of different levels: – SIMD (fine-grain data parallelism)– Thread/process (medium-grain task parallelism)– Distributed system (coarse-grain job parallelism)

Page 3: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.3

Every one is going Manycore, but can the architecture scale?

The computing world is ready for radical change

Time

#C

ore

s

2010

1

2006

2

4

32

20202005

Intel

Sun IBM Cell

nCores

Larrabee

Performance&

Performance/W

Gap

2014

Perfo

rman

ce

Page 4: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.4

Key “Many-Core” Challenges: The 3 P’s

Performance challenge– How to scale from 1 to 1000 cores – the number of

cores is the new Megahertz

Power efficiency challenge– Performance per watt is the new metric – systems are

often constrained by power & cooling

Programming challenge– How to provide a converged many core solution in a

standard programming environment

Page 5: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.5

“Problems cannot be solved by the samelevel of thinking that created them.”

Current technologies fail to deliver– Incremental performance increase– High power– Low level of Integration– Increasingly bigger cores

We need to have a new thinking to get– 10 x performance– 10 x performance per watt – Converged computing– Standard programming models

Page 6: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.6

Stepping Back: How Did We Get Here?

Moore’s Conundrum:

More devices =>? More performance

Old answers: More complex cores; bigger caches– But power-hungry

New answers: More cores– But do conventional approaches scale?

Diminishing returns!

Page 7: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.7

The Old Challenge: CPU-on-a-chip

20 MIPS CPUin 1987

Few thousand gates

Page 8: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.8

What to do with all those transistors?

The Opportunity: Billions of Transistors

Old CPU:

Page 9: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.9

ASICs have high performance and low power• Custom-routed, short wires• Lots of ALUs, registers, memories – huge on-chip parallelism

memmem

mem

mem

mem

Take Inspiration from ASICs

But how to build a programmable chip?

Page 10: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.10

Replace Long Wires with Routed Interconnect

Ctrl

[IEEE Computer ’97]

Page 11: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.11

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

From Centralized Clump of CPUs …

Page 12: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.12

AL

UA

LU A

LU A

LU

AL

U

AL

UR

Scalar Operand Network (SON) [TPDS 2005]

… To Distributed ALUs, Routed Bypass Network

Page 13: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.13

From a Large Centralized Cache…

Page 14: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.14

…to a Distributed Shared Cache

ALU

ALU

ALU

ALU

ALU

ALU

R

$

[ISCA 1999]

Page 15: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.15

Distributed Everything + Routed Interconnect Tiled Multicore

AL

UA

LU A

LU A

LU

AL

U

AL

U

R

$

Each tile is a processor, so programmable

Page 16: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.16

Tiled Multicore Captures ASIC Benefits and is Programmable

Scales to large numbers of cores Modular – design and verify 1 tile Power efficient

– Short wires plus locality opts –

CV2f– Chandrakasan effect, more cores at

lower freq and voltage – CV2f

ProcessorCore

= TileCore + Switch

SCurrent Bus Architecture

Page 17: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.17

Tilera processor portfolio Demonstrating the scale of many-core

200920082007 2010

TILEPro36TILEPro36

TILE64TILE64

TILEPro64TILEPro64

Gx64 & Gx100Up to 8x performance

Gx64 & Gx100Up to 8x performance

Gx16 & Gx362x the performance

Gx16 & Gx362x the performance

. . .

TILE-Gx100100 cores

TILE-Gx6464 cores

TILE-Gx3636 cores

TILE-Gx1616 cores

Page 18: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.18

Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)Memory Controller (DDR3)

Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)Memory Controller (DDR3)

mP

IPE

mP

IPE

1.2GHz – 1.5GHz 32 MBytes total cache 546 Gbps peak mem BW 200 Tbps iMesh BW

80-120 Gbps packet I/O– 8 ports XAUI / 2 XAUI– 2 40Gb Interlaken– 32 ports 1GbE (SGMII)

80 Gbps PCIe I/O– 3 StreamIO ports (20Gb)

Wire-speed packet eng.– 120Mpps

MiCA engines:– 40 Gbps crypto– compress & decompress

FlexibleI/O

FlexibleI/O

UART x2, USB x2,JTAG, I2C, SPI

UART x2, USB x2,JTAG, I2C, SPI

MiCAMiCA

MiCAMiCA

Ser

De

sS

erD

es

PCIe 2.08-lane

PCIe 2.08-lane

Ser

De

sS

erD

es

PCIe 2.04-lane

PCIe 2.04-lane

Ser

De

sS

erD

es

PCIe 2.08-lane

PCIe 2.08-lane

Inte

rla

ken

Inte

rla

ken

Inte

rla

ken

Inte

rla

ken

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

es

Ser

De

s4x GbESGMII

TILE-Gx100™: Complete System-on-a-Chip with 100 64-bit cores

Page 19: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.19

36 Processor Cores 866M, 1.2GHz, 1.5GHz clk 12 MBytes total cache

40 Gbps total packet I/O – 4 ports 10GbE (XAUI)– 16 ports 1GbE (SGMII)

48 Gbps PCIe I/O– 2 16Gbps Stream IO ports

Wire-speed packet engine– 60Mpps

MiCA engine:– 20 Gbps crypto– Compress & decompress

FlexibleI/O

FlexibleI/O

UARTx2, USBx2,JTAG, I2C, SPI

UARTx2, USBx2,JTAG, I2C, SPI

MiCAMiCA

mP

IPE

mP

IPE

10 GbEXAUI

10 GbEXAUI

Ser

Des

Ser

Des

4x GbESGMII

10 GbEXAUI

10 GbEXAUI

Ser

Des

Ser

Des

4x GbESGMII

10 GbEXAUI

10 GbEXAUI

Ser

Des

Ser

Des

4x GbESGMII

10 GbEXAUI

10 GbEXAUI

Ser

Des

Ser

Des

4x GbESGMII

Ser

Des

Ser

Des PCIe 2.0

8-lanePCIe 2.08-lane

Ser

Des

Ser

Des

PCIe 2.04-lane

PCIe 2.04-lane

Ser

Des

Ser

Des PCIe 2.0

4-lanePCIe 2.04-lane

Memory Controller (DDR3)Memory Controller (DDR3)

Memory Controller (DDR3)Memory Controller (DDR3)

TILE-Gx36™: Scaling to a broad range of applications

Page 20: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.20

Full-Featured General Converged Cores

Processor– Each core is a complete computer– 3-way VLIW CPU– SIMD instructions: 32, 16, and 8-bit ops– Instructions for video (e.g., SAD) and

networking– Protection and interrupts

Memory– L1 cache and L2 Cache– Virtual and physical address space– Instruction and data TLBs– Cache integrated 2D DMA engine

Runs SMP Linux Runs off-the-shelf C/C++ programs Signal processing and general apps

Core

Cache

16K L1-I

64K L2

I-TLB

D-TLB

2DDMA8K L1-D

Register File

Three Execution Pipelines

TerabitSwitch

Page 21: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.21

Software must complement the hardware Enable re-use of existing code-bases

Standards-based development environment– e.g. gcc, C, C++, Java, Linux– Comprehensive command-line & GUI-based tools

Support multiple OS models– One OS running SMP– Multiple virtualized OS’s with protection– Bare metal or “zero-overhead” with background OS environment

Support range of parallel programming styles– Threaded programming (pThreads, TBB)– Run-to-Completion with load-balancing– Decomposition & Pipelining– Higher-level frameworks (Erlang, OpenMP, Hadoop etc.)

Page 22: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.22

Software Roadmap

Standards & open source integration– Compiler: gcc, g++ 4.4+– Linux:

• Kernel: Tile architecture integrated to 2.6.36• User-space: glibc, broader set of standard

packages Extended programming and runtime

environments– Java: porting OpenJDK– Virtualization: porting KVM

Page 23: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.23

Tile architecture: The future of many-core computing

Multicore is the way forward– But we need the right architecture to utilize it

The Tile architecture addresses the challenges– Scales to 100’s of cores– Delivers very low power– Runs your existing code

Standards-based software– Familiar tools– Full range of standard programming environments

Page 24: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

Thank you!

Questions?

Page 25: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.25

Research Vision to Commercial Product

2002

2007

100B transistors

100B transistors

2018

1B Transistors

in 2007

1996

The opportunity

CPUMem

1997

A blankslate

MIT Raw 16 cores Tile Processor

64 cores

Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)

Memory Controller (DDR3)

Memory Controller (DDR3)Memory Controller (DDR3) Memory Controller (DDR3)

Memory Controller (DDR3)

mP

IPE

mP

IPE

FlexibleI/O

FlexibleI/O

UART x2, USB x2,JTAG, I2C, SPI

UART x2, USB x2,JTAG, I2C, SPI

MiCAMiCA

MiCAMiCA

Ser

Des

Ser

Des

PCIe 2.08-lane

PCIe 2.08-lane

Ser

Des

Ser

Des

PCIe 2.04-lane

PCIe 2.04-lane

Ser

Des

Ser

Des

PCIe 2.08-lane

PCIe 2.08-lane

Inte

rlak

enIn

terl

aken

Inte

rlak

enIn

terl

aken

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

10 GbEXAUI

10 GbEXAUI S

erD

esS

erD

es4x GbESGMII

TILE-Gx100100 cores

The future?

2010

Page 26: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.26

Standard tools and programming model

Multicore Development Environment

Standard application stack

Standard programming SMP Linux 2.6 ANSI C/C++ Java, PHP

Integrated tools GCC compiler Standard gdb gprof Eclipse IDE

Innovative tools Multicore debug Multicore profile

Standards-based tools

Application layer Open source apps Standard C/C++ libs

Operating System layer 64-way SMP Linux Zero Overhead Linux Bare metal

environment

Hypervisor layer Virtualizes hardware I/O device drivers

Tile Processor

Tile Tile Tile Tile …

Hypervisor

Operating System

Applications

Virtualization and high speed I/O drivers

kernel drivers

Applications libraries

Page 27: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.27

Standard Software Stack

Compiler, OSHypervisor

LanguageSupport

InfrastructureApps

Perl

gcc & g++

Commercial Linux Distribution

ManagementProtocols IPMI 2.0

Transcoding

NetworkMonitoring

Page 28: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.28

High single core performanceComparable to Atom & ARM Cortex-A9 cores

- Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php - TILE-Gx and single thread Atom results were measured in Tilera labs- Single core, single thread result for ARM is calculated based on chip scores

Single-Core Single thread CoreMark™ Comparison

-

500

1,000

1,500

2,000

2,500

3,000

3,500

TileraTILEPro64866 MHz

TileraTILE-Gx361.25 GHz

ARMCortex-A9

1 GHz

IntelAtom N2701600 MHz

Co

reM

ark

Sc

ore

Page 29: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.29 29

Significant value across multiple markets

Networking Multimedia Wireless Cloud• Classification• L4-7 Services• Load Balancing• Monitoring/QoS• Security

• Video Conferencing

• Media Streaming• Transcoding

• Base Station• Media Gateway• Service Nodes• Test Equipment

• Apache• Memcached• Web Applications• LAMP stack

High Performance

Low Power

Standard Programming

Over 100 customersOver 40 customers going into production

Tier 1 customers in all target markets

Page 30: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.30

Targeting markets with highly parallel applications

Web ServingIn Memory Cache

Data Mining

Web

TranscodingVideo deliveryWireless media

Media delivery

Lawful interceptionSurveillance

Other

Government

Common ThemesHundreds and Thousands of servers running each application

thousands of parallel transactionsAll need better performance and power efficiency

Page 31: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.31

200 Tbps on-chip bandwidth

2 Dimensional mesh network

Scales to large numbers of cores Modular: Design-and-verify 1 tile Power efficient:

– Short wires & locality optimize CV2f– Chandrakasan effect, more cores at

lower freq and voltage – CV2f

Core + Switch = Tile

Traditional Bus/Ring Architecture

S

ProcessorCore

The Tile Processor Architecture Mesh interconnect, power-optimized cores

Page 32: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.32

Global 64-bit address space

Distributed caches Big centralized caches don’t scale– Contention– Long latency– High power

Distributed caches have numerous benefits– Lower power (less logic lit-up per access)

– Exploit locality (local L1 & L2)

– Can exploit various cache placement mechanisms to enhance performance

Distributed “everything” Cache, memory management, connectivity

CompletelyHW Coherent

CacheSystem

Page 33: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.33

Highest compute density

2U form factor

4 hot pluggable modules

8 Tilera TILEPro processors

512 general purpose cores

1.3 trillion operations /sec

Page 34: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.34

Power efficient and eco-friendly server

10,000 cores in a 8 Kilowatt rack 35-50 watts max per node Server power of 400 watts 90%+ efficient power supplies Shared fans and power supplies

Page 35: Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com.

HPEC, 15 September 2010© 2010 Copyright Tilera Corporation. All Rights Reserved.35

Coherent distributed cache system

Globally Shared Physical Address Space– Full Hardware Cache Coherence– Standard shared memory programming model

Distributed cache– Each tile has local L1 and L2 caches– Aggregate of L2 serves as a globally shared L3– Any cache block can be replicated locally– Hardware tracks sharers, invalidates stale copies

Dynamic Distributed Cache (DDC™)– Memory pages distributed across all cores or

homed by allocating core

Coherent I/O– Hardware maintains coherence– I/O reads/writes coherent with tile caches– Reads/writes delivered by HW to home cache– Header/packet delivered directly to tile caches

Ser

De

sS

erD

es

GbE 0GbE 0

GbE 1GbE 1

FlexibleI/O

UARTJTAG

SPI, I2C

FlexibleI/O

UARTJTAG

SPI, I2C

DDR2 Controller 3DDR2 Controller 3

DDR2 Controller 0DDR2 Controller 0

DDR2 Controller 2DDR2 Controller 2

DDR2 Controller 1DDR2 Controller 1

PCIe 0PCIe 0

Ser

De

sS

erD

es

PCIe 1PCIe 1

XAUI 0XAUI 0

Ser

De

sS

erD

es

XAUI 1XAUI 1

Ser

De

sS

erD

es

CompletelyCoherentSystem