Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a...
-
Upload
william-edwards -
Category
Documents
-
view
214 -
download
0
Transcript of Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a...
![Page 1: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/1.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
File Systems for your Cluster
Selecting a storage solution for tier 2
Suggestions and experiences
Jos van WezelInstitute for Scientific Computing
Karlsruhe, Germany
![Page 2: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/2.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Overview
• Estimated sizes and needs• GridKa today and roadmap• Connection models• Hardware choices• Software choices• LCG
![Page 3: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/3.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Scaling the tiers
• Tier 0: 2 PB disk 10 PB tape 6000 kSi (data collection, distribution to tier 1
• Tier 1: 1 PB disk 10 PB tape 2000 kSi (data processing, calibration, archiving for tier 2, distribute to tier 2)
• Tier 2: 0.2 PB disk no tape 3000 kSi (dataselections, simulation, distribute to tier 3)
• Tier 3: location and or group specific
1 opteron today ~ 1 kSi
![Page 4: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/4.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
GridKa growth
3000
2000
1000
0
LCG Fase I Fase II Fase III
2000
4000
6000
2002 2003 2004 2005 2006 2007 2008 2009
Fase I Fase II Fase III
Disk
Tape
Tera
Byte
kSI9
5CP
U
![Page 5: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/5.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Storage at GridKa
SAN Fibre Channel
TCP/IP
Cluster nodes
GPFS Servers1 x 2 Gb FC1 - 2 x 1 Gb Ethernet
RAID 5 devices32 x 2 Gb FC1120 disks: 120 TB
GPFS via NFS to nodes dCache via dcap to nodes
TCP/IP
Cluster nodes
2 Pool nodes
Tape libraryDisks 1 TB
TSM server
![Page 6: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/6.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
GridKa road map
• 2004-2005expand and stabilize GPFS / NFS combinationpossibly install Lustreintegrate dCachelook for alternative to TSM if !! really neededTry SATA disks
• 2004-2007decide path for Parallel FS and dCachedecide Tape backendscale for LHC (200 – 300 MB/s continuous for some weeks)
![Page 7: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/7.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Tier 2 targets (source: G.Quast / Uni-KA)
• 5 MB per node throughput• 300 nodes• 1000 MB/s• 200 TB overall disk storage
![Page 8: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/8.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Estimate your needs (1)
• can you charge for the storage?– influences choice between on-line and offline (tape)
– classification of data (volatile, precious, high IO, low IO)
• how many nodes will access the storage simultaneously– Absolute number of nodes
– Number of nodes that run a particular job
– Job classification to separate accesses
![Page 9: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/9.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
•
Estimate your needs (2)
• What kind of access (Read/Write/Transfer sizes)– ability to control access pattern
• Pre-staging
• Software tuning
– job classification to influence access pattern• spread via scheduler
• What size will the storage have eventually– use benefit of random access via large number of controllers – up till 4 TB or 100 MB/s one controller– need high speed disks
![Page 10: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/10.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Disc technology keys
• Disk areal density is larger then tape– disks are rigid
• Density growth rate for disks continues (but slower)– deviation from Moore’s law (same for CPU)
• Superparamagnetic effect is not yet influencing progress– the end has been in view since 20 years
• Convergence of costs for disk and tape stopped– still factor 4 to 5 difference
Disks and tape will be there at least another 10 years
![Page 11: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/11.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Disk areal density vs. head – media spacing
1 10 100 1000 10000 100000
Hitachi Deskstar 7k400 (2004): 400GB, 61 Gb/in.2
IBM RAMAC (1956): 5 MB, 2000kb/in.2
Head to media spacing (nm)
Are
al d
ensi
ty (
Mb/
in.2 )
10 4
10 2
10 -2
1
10 6
10 -3
10 -1
10
10 3
10 5
![Page 12: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/12.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
To SATA or notwhen compared to SCSI/FC
• Up to 4 times cheaper (3 k€ / TB vs. 10 k€ / TB)
• 2 times slower in Multi user environment (access time)
• Not really for 24/7 operation (more failures)
• Larger capacity per diskmax: 140 GB SCSI, 400 GB SATA (today)
• No large scale experience
• Warranty of drives for only 1 or 2 years.
• GridKa uses SCSI, SAN and expensive controllers
• bad experiences with IDE NAS boxes (160 GB disks, 3Ware controllers)
• New products, with SATA disks and expensive controllers
• IO ops are more important then throughput for most accesses
![Page 13: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/13.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Network attached storage
IO –path via the network
File serversClusternodes
IP network
IO –path locallyFibre Channel or SCSI
![Page 14: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/14.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
NAS example
• server with 4 dual SCSI busses – more then 1 GB/s transfer
• 4 x 2 SATA RAID boxes (16 * 250 GB)– ~4 TB per bus
• 2 * 4 * 2 * 4 = 72 TB on a server. • est 30 keuro or 35 keuro with point to point FC Not that bad.
![Page 15: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/15.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SAN Clusternodes
Fibre Channel / iSCSIIO –path to each host
via SAN or iSCSI
![Page 16: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/16.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SAN or Ethernet
• SAN has easier management– exchange of hardware without interruption– joining separate storage elements
• iSCSI needs separate net (SCSI over IP)• Very scalable performance
– via switches or directors• 1 SCSI bus maxes at 320 MB/s
– better than current FC, but FC is duplex– not a fabric– example follows
• ELVM for easier management• Network block device• Kernel 2.6 new 16 TB limit• SAN is expensive (500 EURO HBA, 1000 EURO switch port)• A direct connection limitation can be partly compensated via High Speed interconnect
(InfiniBand,Myrinet etc)• Tighly coupled cluster with InfiniBand. Can be used for FC too, depending on the FS
software.
![Page 17: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/17.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Combining FC and Infiniband
FCP network
Cluster nodes
SAN Disk collection
Cluster nodes
InfiniBand network
![Page 18: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/18.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Software to drive the hardware
• File systems– GPFS (IBM) (GridKa uses this, so does UNI-KA)– SAN-FS (IBM $$) supports a range of architectures– Lustre (HP $) (Uni-KA Rechenzentrum cluster)– PVFS (stability is rather low)– GFS (now RedHat) or OpenGFS– NFS
• Linux implementation is messy but RH 3.0 EL seems promising• NAS boxes reach impressive throughput, are stable, easy management, grow as needed (NetApp,
Exanet)
– Terragrid (very new)
• (Almost-posix) access via library preload– write once / read many– changing a file means creating a new and deleting the old– not usable for all software (e.g. no DBMS!)– Examples Gridftp (gfal), (x)rootd (rfio), dCache (dcap/gfal/rfio)
![Page 19: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/19.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
GPFS
A
D
B
D ABC
C
B
AB A
Stripes over n disks Linux and AIX or combined Max FS size 70 TB HSM option Scalable and very robust Easy management SAN or IP+SAN or IP only Add and remove storage on-line Vendor lock
![Page 20: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/20.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Accumulated throughput as function of number of nodes/raid-arrays (GPFS)
MB
/s
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10
Reading
Writing
![Page 21: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/21.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SAN FS
IP network
MetadataServer Cluster
ClientsLinux / Windows / Macwith STFS filesystem
SANFibre Channel / iSCSI
Storage Tank Protocolover TCP or UDP
Metadata VolumesAttributes, Policies
Disk collection
File Data Volumes
• metadata server failover• policy based management• add and remove storage on line• $$$
![Page 22: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/22.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
LUSTRE
IP network
Metadata ServersClients
failover MDS activeLinux OST serverswith (SAN) disks
• Object based• LDAP config database• Failover of OST’s• Support for heterogeneous network e.g. InfiniBand• Advanced security• Open Source
![Page 23: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/23.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
SRM Storage resource manager
• Glue between worldwide grid and local mass storage (SE)• A storage element should offer:
– GridFTP
– An SRM interface
– Information publication via MDS
• LCG has SRM2 almost …. ready, SRM1 in operation• SRM is build upon known MSS (CASTOR, dCache, Jasmine)• dCache implements SRM v1
![Page 24: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/24.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
User SRM interaction
Legenda:LFN: Logical file nameRMC: Replication metadata catalogGUID: Grid unique identifierRLC: Replica location catalog RLI: Replica location indexRLC + RLI = RLSRLS: Replica location serviceSURL: Site URLTURL: Transfer URL
User
SRM managedStorage
SRM
RLS
RMC
LFN
GUID
GUID
SURL
SURL
TURL
open()
PIN
close()
Release
read/write
GFAL
dC
ach
e
![Page 25: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/25.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
In short
• Loosely coupled cluster: Ethernet• Tightly coupled cluster: InfiniBand• From 100 to 200 TB: local attached, NFS and or RFIO• Above 200 TB: SAN, cluster file system and RFIO• HSM via dCache
– Grid SRM interface– Tape TSM / GSI solution ?? or Vanderbilt Enstor
![Page 26: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/26.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Some encountered difficulties• Prescribed chain of software revision levels
– support is given only to those who live by the rules
– disk -> controller -> hba -> driver -> kernel -> application
• Linux limitations– block addressability < 2^31
– number of LU’s < 128
• NFS on Linux is a running target– enhancements or fixes introduce almost always a new bugs
– limited experience in large (> 100 clients) installations
• Storage units become difficult to handle– exchanging 1 TB and rebalancing of live 5 TB file system takes 20 hrs
– restoring a 5 TB file system can take up to a week
– Acquirement needs 1 FTE / 10^6 €
![Page 27: Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft File Systems for your Cluster Selecting a storage solution for tier 2 Suggestions and experiences.](https://reader036.fdocuments.net/reader036/viewer/2022070305/551460b8550346b0158b48d0/html5/thumbnails/27.jpg)
Forschungszentrum Karlsruhein der Helmholtz-Gemeinschaft
Thank you for your attention