PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory.
-
Upload
marylou-simon -
Category
Documents
-
view
220 -
download
0
Transcript of PetaByte Storage Facility at RHIC Razvan Popescu - Brookhaven National Laboratory.
PetaByte Storage Facility at RHIC
Razvan Popescu - Brookhaven National Laboratory
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 2
Who are we?
Relativistic Heavy-Ion Collider @ BNL– Four experiments: Phenix, Star, Phobos,
Brahms.– 1.5PB per year.– ~500MB/sec.– >20,000SpecInt95.
Startup in May 2000 at 50% capacity and ramp up to nominal parameters in 1 year.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 3
Overview
Data Types:– Raw: very large volume (1.2PB/yr.), average
bandwidth (50MB/s).– DST: average volume (500TB), large
bandwidth (200MB/s).– mDST: low volume (<100TB), large bandwidth
(400MB/s).
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 4
Data Flow (generic)
RHIC
File Servers(DST/mDST)
ReconstructionFarm (Linux)
AnalysisFarm (Linux)
Archive
raw
raw
DST
mDSTmDST
DST
35MB/s
50MB/s
10MB/s
10MB/s
200MB/s
400MB/s
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 5
The Data Store
HPSS (ver. 4.1.1 patch level 2)– Deployed in 1998.– After overcoming some growth difficulties we
consider the present implementation successful.– One major/total reconfiguration to adapt to new
hardware (and system understanding).– Flexible enough for our needs. One shortage:
preemptable priority schema.– Very high performance.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 6
The HPSS Archive
Constraints - large capacity & high bandwidth:– Two types of tape technology: SD-3 (best $/GB) &
9840 (best $/MB/s).– Two tape layers hierarchies. Easy management of
the migration. Reliable and fast disk storage:
– FC attached RAID disk. Platform compatible with HPSS:
– IBM, SUN, SGI.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 7
Present Resources
Tape Storage:– (1) STK Powderhorn silo (6000 cart.)– (11) SD-3 (Redwood) drives.– (10) 9840 (Eagle) drives.
Disk Storage:– ~8TB of RAID disk.
• 1TB for HPSS cache.• 7TB Unix workspace.
Servers:– (5) RS/6000 H50/70 for HPSS.– (6) E450&E4000 for file serving and data mining.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 8
Phenix Data Flow
RHIC 10MB/s 6MB/s 1MB/s File Server16MB/sHPSS (RAW )
150GB @ 80MB/s
10MB/s 6MB/s
Redwood(3)
10MB/s 6MB/s
Reconstr.Farm
(?Si95)(?00 proc.)
HPSS (DST)
Redwood(0)
9840(2)
150GB @ 80MB/s
1MB/s 16MB/s
1MB/s 16MB/s
1MB/s
Analysis Farm(? Si95)
55MB/s
3TB @ 100MB/s
18MB/s 65MB/s
2MB/sCalibration - xMB/s
10MB/s -Calib.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 9
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 10
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 11
HPSS Structure
(1) Core Server:– RS/6000 Model H50– 4x CPU– 2GB RAM– Fast Ethernet (control)– OS mirrored storage for metadata (6pv.)
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 12
HPSS Structure
(3) Movers:– RS/6000 Model H70– 4x CPU– 1GB RAM– Fast Ethernet (control)– Gigabit Ethernet (data) (1500&9000MTU)– 2x FC attached RAID - 300GB - disk cache– (3-4) SD-3 “Redwood” tape transports– (3-4) 9840 “Eagle” tape transports
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 13
HPSS Structure
Guarantee availability of resources for a specific user group separate resources separate PVRs & movers.
One mover per user group total exposure to single-machine failure.
Guarantee availability of resources for Data Acquisition stream separate hierarchies.
Result: 2PVR&2COS&1Mvr per group.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 14
HPSS Structure
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 15
HPSS Topology
M3M2M1Core
Net 2 - Control (100baseT)
Net 1 - Data (1000baseSX)
STK
10baseT
N x PVR
pftpd
Client
(Routing)
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 16
HPSS Performance
80 MB/sec for the disk subsystem. ~1 CPU per 40MB/sec for TCPIP Gbit
traffic @ 1500MTU or 90MB/sec @ 9000MTU
>9MB/sec per SD-3 transport. ~10MB/sec per 9840 transport.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 17
I/O Intensive Systems
Mining and Analysis systems. High I/O & moderate CPU usage. To avoid large network traffic merge file
servers with HPSS movers:– Major problem with HPSS support on non-AIX
platforms.– Several (Sun) SMP machines or Large (SGI)
Modular System.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 18
Problems
Short lifecycle of the SD-3 heads.– ~ 500 hours < 2 months @ average usage. (6 of
10 drives in 10 months). – Built a monitoring tool to try to predict
transport failure (based of soft error frequency). Low throughput interface (F/W) for SD-3:
high slot consumption. SD-3 production discontinued?! 9840 ???
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 19
Issues
Tested the two tape layer hierarchies:– Cartridge based migration.– Manually scheduled reclaim.
Work with large files. Preferable ~1GB. Tolerable >200MB.– Is this true with 9840 tape transports?
Don’t think at NFS. Wait for DFS/GPFS?– We use exclusively pftp.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 20
Issues
Guarantee avail. of resources for specific user groups:– Separate PVRs & movers.– Total exposure to single-mach. failure !
Reliability:– Distribute resources across movers share movers
(acceptable?).– Inter-mover traffic:
• 1 CPU per 40MB/sec TCPIP per adapter: Expensive!!!
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 21
Inter-Mover Traffic - Solutions
Affinity.– Limited applicability.
Diskless hierarchies (not for DFS/GPFS).– Not for SD-3. Not enough tests on 9840.
High performance networking: SP switch. (This is your friend.)– IBM only.
Lighter protocol: HIPPI.– Expensive hardware.
Multiply attached storage (SAN). Most promising! See STK’s talk. Requires HPSS modifications.
CHEP 2000 -- Padova
PetaByte Storage Facility at RHIC 22
Summary
HPSS works for us. Buy an SP2 and the SP switch.
– Simplified admin. Fast interconnect. Ready for GPFS.
Keep an eye on the STK’s SAN/RAIT. Avoid SD-3. (not a risk anymore) Avoid small file access. At least for the
moment.
Thank you!
Razvan [email protected]