PetaCache: Data Access Unleashed
-
Upload
conan-ashley -
Category
Documents
-
view
19 -
download
0
description
Transcript of PetaCache: Data Access Unleashed
PetaCache: Data Access Unleashed
Tofigh Azemoon, Jacek Becla, Chuck Boeheim, Andy Hanushevsky, David Leith, Randy Melen, Richard P. Mount, Teela Pulliam, William Weeks
Stanford Linear Accelerator Center
September 3, 2007
March 7, 2007Richard P. Mount, SLAC 2
Outline
• Motivation – Is there a Problem?
• Economics of Solutions
• Practical Steps – Hardware/Software
• Some Performance Measurements
March 7, 2007Richard P. Mount, SLAC 4
Storage In Research: Financial and Technical Observations
• Storage costs often dominate in research– CPU per $ has fallen faster than disk space per $
for most of the last 25 years
• Accessing data on disks is increasingly difficult
– Transfer rates and access times (per $) are improving more slowly than CPU capacity, storage capacity or network capacity.
• The following slides are based on equipment and services that I* have bought for data-intensive science
* The WAN services from 1998 onwards were bought by Harvey Newman of Caltech
March 7, 2007Richard P. Mount, SLAC 5
Price/Performance Evolution: My Experience
CPUDisk Capacity
WANDisk Random Access
Disk Streaming Access
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,00019
83
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Farm CPU box KSi2000 per $M
Doubling in 1.1 or 1.9 years
Raid Disk GB/$M
Doubling in 1.37 or 1.46 years
Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9 years
Disk Access/s per $M
Disk MB/s per $M
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,00019
83
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years
Raid Disk GB/$M
Doubling in 1.37 or 1.46 years
Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M
Disk MB/s per $M
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,00019
83
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years
Raid Disk GB/$M
Doubling in 1.37 or 1.46 years
Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M
Disk MB/s per $M
CPUDisk Capacity
WANDisk Random Access
Disk Streaming Access
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,00019
83
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years
Raid Disk GB/$M
Doubling in 1.37 or 1.46 years
Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M
Disk MB/s per $M
CPUDisk Capacity
WANDisk Random Access
Disk Streaming Access
March 7, 2007Richard P. Mount, SLAC 6
Price/Performance Evolution: My Experience
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Farm CPU box KSi2000 per$MDoubling in 1.1 or 1.9 years
Raid Disk GB/$M
Doubling in 1.37 or 1.46 years
Transatlantic WAN kB/s per$M/yrDoubling in 8.4. 0.6 or 0.9years Disk Access/s per $M
Disk MB/s per $M
Flash Accesses/s per $M
Flash MB/s per $M
DRAM Accesses/s per $M
DRAM MB/s per $M
March 7, 2007Richard P. Mount, SLAC 7
Another View• In 1997 $M bought me:
~ 200-core CPU farm(~few x 108 ops/sec/core)
or
~ 1000-disk storage system(~2 x 103 ops/sec/disk)
• Today $1M buys me (you):~ 2500-core CPU farm
(~few x 109 ops/sec/core)
or
~ 2500-disk storage system (~2 x 103 ops/sec/disk)
• In 5 – 10 years ?
March 7, 2007Richard P. Mount, SLAC 8
Impact on Science
• Sparse or random access must be derandomized
• Define, in advance, the interesting subsets of the data
• Filter (skim, stream) the data to instantiate interest-rich subsets
March 7, 2007Richard P. Mount, SLAC 10
Economics of LHC Computing
• Difficult to get $10M additional funding to improve analysis productivity
• Easy to re-purpose $10M of computing funds if it would improve analysis productivity
March 7, 2007Richard P. Mount, SLAC 11
Cost-Effectiveness• DRAM Memory:
– $100/gigabyte
– SLAC spends ~12% of its hardware budget on DRAM
• Disks (including servers)– $1/gigabyte
– SLAC spends about 40% of its hardware budget on disk
• Flash-based storage (SLAC design)– $10/gigabyte
– If SLAC had been spending 20% of its hardware budget on Flash we would have over 100TB today.
March 7, 2007Richard P. Mount, SLAC 13
PetaCache Goals
1. Demonstrate a revolutionary but cost effective new architecture for science data analysis
2. Build and operate a machine that will be well matched to the challenges of SLAC/Stanford science
March 7, 2007Richard P. Mount, SLAC 14
The PetaCache Story So Far
• We (BaBar, HEP) had data-access problems
• We thought and investigated– Underlying technical issues
– Broader data-access problems in science
• We devised a hardware solution– We built a DRAM-based prototype
– We validated the efficiency and scalability of our low-level data-access software, xrootd
– We set up a collaboration with SLAC’s electronics wizards (Mike Huffer and Gunther Haller) to develop a more cost-effective Flash-based prototype
• We saw early on that new strategies and software for data access would also be needed
March 7, 2007Richard P. Mount, SLAC 15
DRAM-Based Prototype Machine(Operational early 2005)
Cisco Switch
Data-Servers 64 Nodes, each Sun V20z, 2 Opteron CPU, 16 GB memory
1TB total MemorySolaris or Linux (mix and match)
Cisco Switches
Clientsup to 2000 Nodes, each 2 CPU, 2 GB memory
Linux
PetaCacheMICS + HEP-
BaBar Funding
Existing HEP-Funded BaBar Systems
March 7, 2007Richard P. Mount, SLAC 17
FLASH-Based PrototypeOperational Real Soon Now
• 5 TB of Flash memory
• Fine-grained, high bandwidth access
March 7, 2007Richard P. Mount, SLAC 18
Department of Particle & Particle Astrophysics
11
Department of Particle & Particle Astrophysics
Building BlocksBuilding Blocks
HostClient Interface (SOFI)
(1 of n)
1 Gigabit Ethernet (.1 GByte/sec)(1 of n)
Application specific
Cluster Inter-Connect Module(CIM)
Host Inter-connect
1 Gbyte/secPGP (Pretty Good Protocol)
Network Attached Storage8 x 10 G-Ethernet
(8 GByte/sec)
10 G-Ethernet
Slice Access Module(SAM)
Four Slice Module (FSM)
256 GByte Flash
Slide from Mike Huffer
March 7, 2007Richard P. Mount, SLAC 19
Department of Particle & Particle Astrophysics
22
Department of Particle & Particle Astrophysics
The The ““ChassisChassis””
AcceptsDC power
PassiveBackplane
8 U
X2(XENPACK MSA)
1UFan-Tray
1UAir-Outlet
1UAir-Inlet
• 2 FSMs/Card
– 1/2 TByte
• 16 Cards/Bank– 8 TByte
• 2 Banks/Chassis
– 64 SAMS
– 1 CIM
– 16 TByte
• 3 chassis/rack
– 48 TByte
Supervisor Card (8U)
Line Card (4U) Slide from Mike Huffer
March 7, 2007Richard P. Mount, SLAC 20
Department of Particle & Particle Astrophysics
23
Department of Particle & Particle Astrophysics
48 TByte facility48 TByte facility
Catalyst 6500 (3 x 4 10GE, 2 x 48 1GE)
SOFI Host ( 1 x 96)xRootD servers
1 chassis
Slide from Mike Huffer
March 7, 2007Richard P. Mount, SLAC 21
Commercial Product
• Violin Technologies
– 100s of GB of DRAM per box (available now)
– TB of Flash per box (available real soon now)
– PCIe hardware interface
– Simple block-level device interface
– DRAM prototype tested at SLAC
March 7, 2007Richard P. Mount, SLAC 24
Latency (2)Current reality
Client Application
Xrootd Data-Server-Client
OS
TCP Stack
NIC
Xrootd Data Server
OS
TCP Stack
NIC
NetworkSwitches
OS
File System
Disk
March 7, 2007Richard P. Mount, SLAC 25
Latency (3)Immediately Practical Goal
Client Application
OS
TCP Stack
NIC
OS
TCP Stack
NIC
NetworkSwitches
OS
File System
Disk
Memory
Xrootd Data Server Xrootd Data-Server-Client
March 7, 2007Richard P. Mount, SLAC 26
DRAM-Based PrototypeLatency (microseconds) versus data retrieved (bytes)
0.00
50.00
100.00
150.00
200.00
250.00
100
600
1100
1600
2100
2600
3100
3600
4100
4600
5100
5600
6100
6600
7100
7600
8100
Server xrootd overhead
Server xrootd CPU
Client xroot overhead
Client xroot CPU
TCP stack, NIC, switching
Min transmission time
March 7, 2007Richard P. Mount, SLAC 27
DRAM-Based PrototypeThroughput Measurements
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1 5 10 15 20 25 30 35 40 45 50
Number of Clients for One Server
Tra
nsa
ctio
ns
per
Sec
on
d
Linux Client - Solaris Server
Linux Client - Linux Server
Linux Client - Solaris Server bge
22 processor microseconds per transaction
March 7, 2007Richard P. Mount, SLAC 28
Throughput Tests
• ATLAS AOD Analysis
– 1 GB file size (a disk can hold 500 – 1000 of these)
– 59 xrootd client machines (up to 118 cores) performing top analysis getting data from 1 server.
– The individual analysis jobs perform sequential access.
– Compare time to completion when server uses its disk, compared with time taken when server uses its memory.
March 7, 2007Richard P. Mount, SLAC 29
0
100
200
300
400
500
600
700
800
900
0 1 2 3 4 5 6 7 8 9 10
Total
Count of Disk/Memory Time Ratio
Disk/Memory Time Ratio
DRAM-based ProtoypeATLAS AOD Analysis
March 7, 2007Richard P. Mount, SLAC 30
Comments and Outlook
• Significant, but not revolutionary, benefits for high-load sequential data analysis – as expected.
• Revolutionary benefits expected for pointer-based data analysis – but not yet tested.
• The need to access storage in serial mode has become part of the culture of data-intensive science – why design a pointer-based analysis when its performance is certain to be abysmal?
• TAG database driven analysis?