PetaCache: Data Access Unleashed

30
PetaCache: Data Access Unleashed Tofigh Azemoon, Jacek Becla, Chuck Boeheim, Andy Hanushevsky, David Leith, Randy Melen, Richard P. Mount, Teela Pulliam, William Weeks Stanford Linear Accelerator Center September 3, 2007

description

PetaCache: Data Access Unleashed. Tofigh Azemoon, Jacek Becla, Chuck Boeheim, Andy Hanushevsky, David Leith, Randy Melen, Richard P. Mount, Teela Pulliam, William Weeks Stanford Linear Accelerator Center September 3, 2007. Outline. Motivation – Is there a Problem? Economics of Solutions - PowerPoint PPT Presentation

Transcript of PetaCache: Data Access Unleashed

PetaCache: Data Access Unleashed

Tofigh Azemoon, Jacek Becla, Chuck Boeheim, Andy Hanushevsky, David Leith, Randy Melen, Richard P. Mount, Teela Pulliam, William Weeks

Stanford Linear Accelerator Center

September 3, 2007

March 7, 2007Richard P. Mount, SLAC 2

Outline

• Motivation – Is there a Problem?

• Economics of Solutions

• Practical Steps – Hardware/Software

• Some Performance Measurements

Motivation

March 7, 2007Richard P. Mount, SLAC 4

Storage In Research: Financial and Technical Observations

• Storage costs often dominate in research– CPU per $ has fallen faster than disk space per $

for most of the last 25 years

• Accessing data on disks is increasingly difficult

– Transfer rates and access times (per $) are improving more slowly than CPU capacity, storage capacity or network capacity.

• The following slides are based on equipment and services that I* have bought for data-intensive science

* The WAN services from 1998 onwards were bought by Harvey Newman of Caltech

March 7, 2007Richard P. Mount, SLAC 5

Price/Performance Evolution: My Experience

CPUDisk Capacity

WANDisk Random Access

Disk Streaming Access

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,00019

83

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Farm CPU box KSi2000 per $M

Doubling in 1.1 or 1.9 years

Raid Disk GB/$M

Doubling in 1.37 or 1.46 years

Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9 years

Disk Access/s per $M

Disk MB/s per $M

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,00019

83

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years

Raid Disk GB/$M

Doubling in 1.37 or 1.46 years

Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M

Disk MB/s per $M

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,00019

83

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years

Raid Disk GB/$M

Doubling in 1.37 or 1.46 years

Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M

Disk MB/s per $M

CPUDisk Capacity

WANDisk Random Access

Disk Streaming Access

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,00019

83

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years

Raid Disk GB/$M

Doubling in 1.37 or 1.46 years

Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M

Disk MB/s per $M

CPUDisk Capacity

WANDisk Random Access

Disk Streaming Access

March 7, 2007Richard P. Mount, SLAC 6

Price/Performance Evolution: My Experience

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

100,000,000,000

1,000,000,000,000

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years

Raid Disk GB/$M

Doubling in 1.37 or 1.46 years

Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M

Disk MB/s per $M

Flash Accesses/s per $M

Flash MB/s per $M

DRAM Accesses/s per $M

DRAM MB/s per $M

March 7, 2007Richard P. Mount, SLAC 7

Another View• In 1997 $M bought me:

~ 200-core CPU farm(~few x 108 ops/sec/core)

or

~ 1000-disk storage system(~2 x 103 ops/sec/disk)

• Today $1M buys me (you):~ 2500-core CPU farm

(~few x 109 ops/sec/core)

or

~ 2500-disk storage system (~2 x 103 ops/sec/disk)

• In 5 – 10 years ?

March 7, 2007Richard P. Mount, SLAC 8

Impact on Science

• Sparse or random access must be derandomized

• Define, in advance, the interesting subsets of the data

• Filter (skim, stream) the data to instantiate interest-rich subsets

Economics of Solutions

March 7, 2007Richard P. Mount, SLAC 10

Economics of LHC Computing

• Difficult to get $10M additional funding to improve analysis productivity

• Easy to re-purpose $10M of computing funds if it would improve analysis productivity

March 7, 2007Richard P. Mount, SLAC 11

Cost-Effectiveness• DRAM Memory:

– $100/gigabyte

– SLAC spends ~12% of its hardware budget on DRAM

• Disks (including servers)– $1/gigabyte

– SLAC spends about 40% of its hardware budget on disk

• Flash-based storage (SLAC design)– $10/gigabyte

– If SLAC had been spending 20% of its hardware budget on Flash we would have over 100TB today.

Practical Steps

The PetaCache Project

March 7, 2007Richard P. Mount, SLAC 13

PetaCache Goals

1. Demonstrate a revolutionary but cost effective new architecture for science data analysis

2. Build and operate a machine that will be well matched to the challenges of SLAC/Stanford science

March 7, 2007Richard P. Mount, SLAC 14

The PetaCache Story So Far

• We (BaBar, HEP) had data-access problems

• We thought and investigated– Underlying technical issues

– Broader data-access problems in science

• We devised a hardware solution– We built a DRAM-based prototype

– We validated the efficiency and scalability of our low-level data-access software, xrootd

– We set up a collaboration with SLAC’s electronics wizards (Mike Huffer and Gunther Haller) to develop a more cost-effective Flash-based prototype

• We saw early on that new strategies and software for data access would also be needed

March 7, 2007Richard P. Mount, SLAC 15

DRAM-Based Prototype Machine(Operational early 2005)

Cisco Switch

Data-Servers 64 Nodes, each Sun V20z, 2 Opteron CPU, 16 GB memory

1TB total MemorySolaris or Linux (mix and match)

Cisco Switches

Clientsup to 2000 Nodes, each 2 CPU, 2 GB memory

Linux

PetaCacheMICS + HEP-

BaBar Funding

Existing HEP-Funded BaBar Systems

March 7, 2007Richard P. Mount, SLAC 16

DRAM-Based Prototype

March 7, 2007Richard P. Mount, SLAC 17

FLASH-Based PrototypeOperational Real Soon Now

• 5 TB of Flash memory

• Fine-grained, high bandwidth access

March 7, 2007Richard P. Mount, SLAC 18

Department of Particle & Particle Astrophysics

11

Department of Particle & Particle Astrophysics

Building BlocksBuilding Blocks

HostClient Interface (SOFI)

(1 of n)

1 Gigabit Ethernet (.1 GByte/sec)(1 of n)

Application specific

Cluster Inter-Connect Module(CIM)

Host Inter-connect

1 Gbyte/secPGP (Pretty Good Protocol)

Network Attached Storage8 x 10 G-Ethernet

(8 GByte/sec)

10 G-Ethernet

Slice Access Module(SAM)

Four Slice Module (FSM)

256 GByte Flash

Slide from Mike Huffer

March 7, 2007Richard P. Mount, SLAC 19

Department of Particle & Particle Astrophysics

22

Department of Particle & Particle Astrophysics

The The ““ChassisChassis””

AcceptsDC power

PassiveBackplane

8 U

X2(XENPACK MSA)

1UFan-Tray

1UAir-Outlet

1UAir-Inlet

• 2 FSMs/Card

– 1/2 TByte

• 16 Cards/Bank– 8 TByte

• 2 Banks/Chassis

– 64 SAMS

– 1 CIM

– 16 TByte

• 3 chassis/rack

– 48 TByte

Supervisor Card (8U)

Line Card (4U) Slide from Mike Huffer

March 7, 2007Richard P. Mount, SLAC 20

Department of Particle & Particle Astrophysics

23

Department of Particle & Particle Astrophysics

48 TByte facility48 TByte facility

Catalyst 6500 (3 x 4 10GE, 2 x 48 1GE)

SOFI Host ( 1 x 96)xRootD servers

1 chassis

Slide from Mike Huffer

March 7, 2007Richard P. Mount, SLAC 21

Commercial Product

• Violin Technologies

– 100s of GB of DRAM per box (available now)

– TB of Flash per box (available real soon now)

– PCIe hardware interface

– Simple block-level device interface

– DRAM prototype tested at SLAC

Some Performance Measurements

March 7, 2007Richard P. Mount, SLAC 23

Latency (1)Ideal

Client Application

Memory

March 7, 2007Richard P. Mount, SLAC 24

Latency (2)Current reality

Client Application

Xrootd Data-Server-Client

OS

TCP Stack

NIC

Xrootd Data Server

OS

TCP Stack

NIC

NetworkSwitches

OS

File System

Disk

March 7, 2007Richard P. Mount, SLAC 25

Latency (3)Immediately Practical Goal

Client Application

OS

TCP Stack

NIC

OS

TCP Stack

NIC

NetworkSwitches

OS

File System

Disk

Memory

Xrootd Data Server Xrootd Data-Server-Client

March 7, 2007Richard P. Mount, SLAC 26

DRAM-Based PrototypeLatency (microseconds) versus data retrieved (bytes)

0.00

50.00

100.00

150.00

200.00

250.00

100

600

1100

1600

2100

2600

3100

3600

4100

4600

5100

5600

6100

6600

7100

7600

8100

Server xrootd overhead

Server xrootd CPU

Client xroot overhead

Client xroot CPU

TCP stack, NIC, switching

Min transmission time

March 7, 2007Richard P. Mount, SLAC 27

DRAM-Based PrototypeThroughput Measurements

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

1 5 10 15 20 25 30 35 40 45 50

Number of Clients for One Server

Tra

nsa

ctio

ns

per

Sec

on

d

Linux Client - Solaris Server

Linux Client - Linux Server

Linux Client - Solaris Server bge

22 processor microseconds per transaction

March 7, 2007Richard P. Mount, SLAC 28

Throughput Tests

• ATLAS AOD Analysis

– 1 GB file size (a disk can hold 500 – 1000 of these)

– 59 xrootd client machines (up to 118 cores) performing top analysis getting data from 1 server.

– The individual analysis jobs perform sequential access.

– Compare time to completion when server uses its disk, compared with time taken when server uses its memory.

March 7, 2007Richard P. Mount, SLAC 29

0

100

200

300

400

500

600

700

800

900

0 1 2 3 4 5 6 7 8 9 10

Total

Count of Disk/Memory Time Ratio

Disk/Memory Time Ratio

DRAM-based ProtoypeATLAS AOD Analysis

March 7, 2007Richard P. Mount, SLAC 30

Comments and Outlook

• Significant, but not revolutionary, benefits for high-load sequential data analysis – as expected.

• Revolutionary benefits expected for pointer-based data analysis – but not yet tested.

• The need to access storage in serial mode has become part of the culture of data-intensive science – why design a pointer-based analysis when its performance is certain to be abysmal?

• TAG database driven analysis?