A Scalable 3D Heterogeneous Multi-Core Processor with Inductive ...

1
Cube: a scalable 3D heterogeneous multi-core processer with inductive-coupling thruchip interface A Scalable 3D Heterogeneous Multi-Core Processor with Inductive-Coupling ThruChip Interface Noriyuki Miura 1 , Yusuke Koizumi 1 , Eiichi Sasaki 1 , Yasuhiro Take 1 , Hiroki Matsutani 1 , Tadahiro Kuroda 1 , Hideharu Amano 1 , Ryuichi Sakamoto 2 , Mitaro Namiki 2 , Kimiyoshi Usami 3 , Masaaki Kondo 4 , Hiroshi Nakamura 5 1 Keio Univ., 2 Tokyo Univ. of Agri. And Tech., 3 Shibaura Institute of Technology, 4 University of Electro-Communications, 5 University of Tokyo Recent battery driven IT devices including smart phone and tablets require versatile functions and high performance with low energy consumption. On the other hand, the initial cost of LSI for design and mask development has increased rapidly, and development of an SoC (System-on-a Chip) for each product has become difficult. Although flexible reconfigurable architectures can be a solution, the performance scalability is also necessary to cope with the wide performance range of products. As a solution, heterogeneous multi-core system using a 3-D wireless inductive coupling interconnect is proposed. This system consists of a MIPS-R3000 compatible embedded CPU and reconfigurable accelerators. Since chips are connected with wireless inductive coupling channels, the number and types of accelerators can be tailored easily depending on the requirement of the product. Introduction 2 types of image processing on 2-stacked Cube and electric current of each chip will be demonstrated. The effect of power gating is also shown. Demonstration [1] Y. Koizumi, et al, “CMA-Cube: a scalable reconfigurable accelerator with 3-D wireless inductive coupling interconnect”, In Proceedings of FPL, Aug. 2012. [2] D. Ikebuchi, et al., “Geyser -1: A MIPS R3000 CPU core with Fine Grain Runtime Power Gating”, In Proceedings of ASSCC, Nov 2009, pp. 281284. [3] N. Ozaki, et al, “Cool Mega Arrays: Ultralow-Power Reconfigurable Accelerator Chips", Micro.IEEE, vol. 31, no. 6, pp. 618, Nov-Dec 2011. Reference Power Supply Circuit Serial Interface for monitoring the current and voltage Serial Interface for terminal Spartan-6 for the interface between Cube and Virtex-6. Virtex-6 for I/O management Compact Flash Memory for saving the application software CMA Core CMA Core CMA Core Geyser Core Geyser Core 3D Interconnect CMA Core CMA Core Geyser Core A Cube with 3-chips is available. Achievement MIPS-R3000 compatible CPU core 8KB instruction/data cache 16 entry TLB CoProcessor0 Function module of power gating 0 0.2 0.4 0.6 0.8 1 0 50 100 150 Layout of Geyser-Cube Geyser Core GeyserCube w PG GeyserCube w/o PG PE Array with no memory element (8 x 8) μ-controller for data management (14bits x 64 instruction) 25bits x 1K data memory (2 bank) The circuit for the wireless interconnect CMA Core Performance versus Array Voltage Applications PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE μ-Controller Data Memory Structure of CMA Core GeyserCube Leakage Reduction Layout of CMA-Cube The circuit for the wireless interconnect The number of accelerator chips, CMA-Cubes can be changed at low cost. Now, at most 3 CMA-Cubes is available. Routers for topology free NoC with bubble flow control Inductors for the wireless interconnect High speed communication Stable and reliable communication the delay time of the PE array the data handling time taken by the micro-controller In case that <, the supply voltage for the PE array is scaled to reduce the power consumption without degrading the performance. In case that >, wave-pipelining is applied to enhance performance of the PE array. CMA Core Controlling Method Bit Error Rate Router0 Network Interface Processor Core Router1 Router0 Network Interface Processor Core Router1 Router0 Network Interface Processor Core Router1 Router0 Network Interface Processor Core Router1 Data Link Clock Link Structure of NoC with wireless interconnect : Inductor for transmitting data : Inductor for receiving data Chips are stacked with a certain length of gap The gap can be used for bonding wires to supply voltage, ground and system clock A unidirectional ring packet switching network is formed Chip Stacking Method Network Interface Processor Core 3D Interconnect Generate packets Interpret packets Network Interface Packet Generator Packet Interpreter Overview of Network Interface Cube on Real Chip Photograph of Cube Specification of Cube Process Technology Fujitsu e-shuttle 65-nm 12-metal CMOS Chip Area 2.1mm x 4.2mm Core Area 1.5mm x 3.6mm Geyser-Cube Clock Freq. 200MHz Supply Voltage 1.2V I/O 3Gbps TCI 100MHz 32bit I/O CMA-Cube Clock Freq. 200MHz Supply Voltage PE Array: 0.5V - 1.2V Other: 1.2V I/O 3Gbps TCI The host CPU chip, Geyser-Cube is placed at the top for more bonding wire. Clock Freq.: 50MHz Execute following application α-Blending Sepia Filter Router0 Network Interface CMA Core Router1 Router0 Network Interface CMA Core Router1 Router0 Network Interface Geyser Core Router1 Data Link Clock Link : Inductor for transmitting data : Inductor for receiving data Now, practical application programs using 3- chip system is under development Structure of 3-stacked Cube

Transcript of A Scalable 3D Heterogeneous Multi-Core Processor with Inductive ...

Page 1: A Scalable 3D Heterogeneous Multi-Core Processor with Inductive ...

Cube: a scalable 3D heterogeneous multi-core processer with inductive-coupling thruchip interface

A Scalable 3D Heterogeneous Multi-Core Processor with

Inductive-Coupling ThruChip Interface

Noriyuki Miura1, Yusuke Koizumi1, Eiichi Sasaki1, Yasuhiro Take1, Hiroki Matsutani1, Tadahiro Kuroda1, Hideharu Amano1,

Ryuichi Sakamoto2, Mitaro Namiki2, Kimiyoshi Usami3, Masaaki Kondo4, Hiroshi Nakamura5 1Keio Univ., 2Tokyo Univ. of Agri. And Tech., 3Shibaura Institute of Technology, 4University of Electro-Communications, 5University of Tokyo

Recent battery driven IT devices including smart phone and tablets require versatile functions and high performance with low energy consumption. On the other

hand, the initial cost of LSI for design and mask development has increased rapidly, and development of an SoC (System-on-a Chip) for each product has

become difficult. Although flexible reconfigurable architectures can be a solution, the performance scalability is also necessary to cope with the wide

performance range of products. As a solution, heterogeneous multi-core system using a 3-D wireless inductive coupling interconnect is proposed. This system

consists of a MIPS-R3000 compatible embedded CPU and reconfigurable accelerators. Since chips are connected with wireless inductive coupling channels,

the number and types of accelerators can be tailored easily depending on the requirement of the product.

Introduction

2 types of image processing on 2-stacked Cube and electric current of each chip will be

demonstrated. The effect of power gating is also shown.

Demonstration

[1] Y. Koizumi, et al, “CMA-Cube: a scalable reconfigurable accelerator with 3-D wireless inductive coupling interconnect”, In Proceedings of FPL, Aug. 2012.

[2] D. Ikebuchi, et al., “Geyser-1: A MIPS R3000 CPU core with Fine Grain Runtime Power Gating”, In Proceedings of ASSCC, Nov 2009, pp. 281–284.

[3] N. Ozaki, et al, “Cool Mega Arrays: Ultralow-Power Reconfigurable Accelerator Chips", Micro.IEEE, vol. 31, no. 6, pp. 6–18, Nov-Dec 2011.

Reference

Power Supply Circuit

Serial Interface for monitoring

the current and voltage

Serial Interface

for terminal

Spartan-6 for the interface

between Cube and Virtex-6.

Virtex-6 for

I/O management

Compact Flash Memory for

saving the application software

CMA Core

CMA Core

CMA Core

Geyser Core

Geyser Core 3D Interconnect

CMA Core

CMA Core

Geyser Core

A Cube with 3-chips is available.

Achievement

MIPS-R3000 compatible CPU core

8KB instruction/data cache

16 entry TLB

CoProcessor0

Function module of power gating

0

0.2

0.4

0.6

0.8

1

0 50 100 150Layout of Geyser-Cube

Geyser Core

GeyserCube w PG

GeyserCube w/o PG

PE Array with no memory element (8 x 8)

μ-controller for data management (14bits x 64 instruction)

25bits x 1K data memory (2 bank)

The circuit for

the wireless

interconnect

CMA Core

Performance versus Array Voltage Applications

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

PE PE PE PE PE PE PE PE

μ-Controller

Data Memory

Structure of CMA Core

GeyserCube Leakage Reduction

Layout of CMA-Cube

The circuit for

the wireless interconnect

The number of accelerator chips,

CMA-Cubes can be changed at low cost.

Now, at most 3 CMA-Cubes is available.

Routers for topology free NoC with bubble flow control

Inductors for the wireless interconnect

– High speed communication

– Stable and reliable communication

① the delay time of the PE array

② the data handling time taken by the

micro-controller

In case that ①<②,

the supply voltage for the PE array is

scaled to reduce the power consumption

without degrading the performance.

In case that ①>②,

wave-pipelining is applied to enhance

performance of the PE array.

CMA Core Controlling Method

Bit Error Rate

Router0

Network Interface

Processor Core

Router1

Router0

Network Interface

Processor Core

Router1

Router0

Network Interface

Processor Core

Router1

Router0

Network Interface

Processor Core

Router1

Data Link

Clock Link

Structure of NoC with wireless interconnect

: Inductor for transmitting data

: Inductor for receiving data

• Chips are stacked with a certain

length of gap

• The gap can be used for bonding

wires to supply voltage, ground

and system clock

• A unidirectional ring packet

switching network is formed

Chip Stacking Method

Network Interface

Processor Core

3D Interconnect

Generate packets Interpret packets

Network Interface

Packet Generator

Packet Interpreter

Overview of Network Interface

Cube on Real Chip

Photograph of Cube

Specification of Cube

Process

Technology Fujitsu e-shuttle 65-nm 12-metal CMOS

Chip Area 2.1mm x 4.2mm

Core Area 1.5mm x 3.6mm

Geyser-Cube

Clock Freq. 200MHz

Supply Voltage

1.2V

I/O 3Gbps TCI 100MHz 32bit I/O

CMA-Cube

Clock Freq. 200MHz

Supply Voltage

PE Array: 0.5V - 1.2V Other: 1.2V

I/O 3Gbps TCI

The host CPU chip, Geyser-Cube is placed

at the top for more bonding wire.

Clock Freq.: 50MHz

Execute following application

α-Blending

Sepia Filter

Router0

Network Interface

CMA Core

Router1

Router0

Network Interface

CMA Core

Router1

Router0

Network Interface

Geyser Core

Router1

Data Link

Clock Link

: Inductor for transmitting data

: Inductor for receiving data

Now, practical application programs using 3-chip system is under development

Structure of 3-stacked Cube