Fermilab Mass Storage System

24
Fermilab Mass Fermilab Mass Storage System Storage System Gene Oleynik Integrated Gene Oleynik Integrated Administration, Fermilab Administration, Fermilab

description

Fermilab Mass Storage System. Gene Oleynik Integrated Administration, Fermilab. Introduction. Fermi National Accelerator Laboratory (FNAL) The premier, highest energy, particle accelerator in the world (for about two more years) - PowerPoint PPT Presentation

Transcript of Fermilab Mass Storage System

Page 1: Fermilab Mass Storage System

Fermilab Mass Fermilab Mass Storage SystemStorage System

Gene Oleynik Integrated Gene Oleynik Integrated Administration, FermilabAdministration, Fermilab

Page 2: Fermilab Mass Storage System

IntroductionIntroduction Fermi National Accelerator Laboratory Fermi National Accelerator Laboratory

(FNAL)(FNAL) The premier, highest energy, particle accelerator in The premier, highest energy, particle accelerator in

the world (for about two more years)the world (for about two more years) Accelerate protons and anti-protons in a 4 mile Accelerate protons and anti-protons in a 4 mile

circle and crash them together in collision halls to circle and crash them together in collision halls to study elementary particle physicsstudy elementary particle physics

Digitized data from the experiments from Digitized data from the experiments from experiments written into the FNAL Mass Storage experiments written into the FNAL Mass Storage SystemSystem

Overview of the FNAL Mass Storage SystemOverview of the FNAL Mass Storage System Users and use pattern of the systemUsers and use pattern of the system Details on MSS with emphasis on features that Details on MSS with emphasis on features that

enable scalability and performanceenable scalability and performance

Page 3: Fermilab Mass Storage System

IntroductionIntroduction Multi-Petabyte tertiary Multi-Petabyte tertiary

tape store for world-tape store for world-wide HEP and other wide HEP and other scientific endeavors. scientific endeavors. THE site storeTHE site store

High Availability (24x7) High Availability (24x7) On and off-site (off-On and off-site (off-

continent) accesscontinent) access

PB/yr projection (current 2.6 PB)

0

5

10

15

20

2003 2004 2005 2006 2007 2008 2009

Year

PB w

ritte

n

Scalable Hardware and Scalable Hardware and Software ArchitectureSoftware Architecture

Front-end disk cachingFront-end disk caching Evolves to meet Evolves to meet

evolving requirementsevolving requirements

TB/day transferred (current 25 TB/day)

0

20

40

60

80

2005 2006 2007 2008 2009

TB/day

Year

Page 4: Fermilab Mass Storage System

UsersUsers Local HEP experiments: CDF, D0, minos, mini-Local HEP experiments: CDF, D0, minos, mini-

boone (> 1 PB/yr, 25 TB/day peak transfers). \ boone (> 1 PB/yr, 25 TB/day peak transfers). \ Remote HEP experiments: Tier 1 site for CMS Remote HEP experiments: Tier 1 site for CMS

via Grid from CERN Switzerland ( 3.5 PB/yr via Grid from CERN Switzerland ( 3.5 PB/yr 2007+)2007+)

Other remote scientific endeavors: Sloan Digital Other remote scientific endeavors: Sloan Digital Sky Survey (ship dlt tapes written at Apache Sky Survey (ship dlt tapes written at Apache Point Observatory, New Mexico), auger Point Observatory, New Mexico), auger (Argentina)(Argentina)

Quite a few others: Lattice QCD theory for Quite a few others: Lattice QCD theory for exampleexample

Page 5: Fermilab Mass Storage System

The DZero Experiment

The CDF Experiment

And Many OthersKTev, Minos, Mini-boone, …The CMS ExperimentSloan Digital Sky Survey

Page 6: Fermilab Mass Storage System

• Users write RAW data to the MSS, analyze/reanalyze it in Users write RAW data to the MSS, analyze/reanalyze it in real-time on PC “farms”, then write results into the MSSreal-time on PC “farms”, then write results into the MSS

• ~3 bytes read for every byte written ~3 bytes read for every byte written • Distribution of cartridge mounts skewed to large numbers Distribution of cartridge mounts skewed to large numbers of mountsof mounts

•Lifetime of data on the order 5-10 years or moreLifetime of data on the order 5-10 years or more

Page 7: Fermilab Mass Storage System

Plot of User ActivityPlot of User Activity

Page 8: Fermilab Mass Storage System

SoftwareSoftware enstoreenstore

In-house, manages files, tape volumes, tape In-house, manages files, tape volumes, tape librarieslibraries

End-user direct interface to files on tape End-user direct interface to files on tape dCache dCache

Joint DESY and Fermilab, disk caching front-endJoint DESY and Fermilab, disk caching front-end End user interface to read cached files, write files End user interface to read cached files, write files

to enstore indirectly via dCacheto enstore indirectly via dCache pnfs pnfs

DESY file namespace server (nfs interface) used by DESY file namespace server (nfs interface) used by both enstore and dCacheboth enstore and dCache

Page 9: Fermilab Mass Storage System

Current Facilities Current Facilities 6 Storage Tek 9310 libraries (9940B, 9940A)6 Storage Tek 9310 libraries (9940B, 9940A) 1 3-quadratower AML/2 library (LTO-1, LTO-2, DLT, 2 1 3-quadratower AML/2 library (LTO-1, LTO-2, DLT, 2

quads implemented)quads implemented) > 100 tape drives and associated “mover” computers> 100 tape drives and associated “mover” computers 25 enstore server computers25 enstore server computers ~100 dCache nodes with ~225 Terabytes of raid disk~100 dCache nodes with ~225 Terabytes of raid disk Allocated amongst 3 instantiations of enstore:Allocated amongst 3 instantiations of enstore:

cdfen (CDF experiment) – 2 9310 + 40 drives + plenty dCachecdfen (CDF experiment) – 2 9310 + 40 drives + plenty dCache D0en (D0 experiment) – 2 9310 + 40 drive, 1 AML/2 quad + D0en (D0 experiment) – 2 9310 + 40 drive, 1 AML/2 quad +

20 drives20 drives Stken (General purpose + CMS) – 2 9310 + 23 drives 1 AML/2 Stken (General purpose + CMS) – 2 9310 + 23 drives 1 AML/2

quad (dlts) + 8 drivesquad (dlts) + 8 drives Over 25000 tapes with 2.6 P of dataOver 25000 tapes with 2.6 P of data

Page 10: Fermilab Mass Storage System

ArchitectureArchitecture File based, user presented with nfs File based, user presented with nfs

namespace. ls, rm, etc. but files are stubsnamespace. ls, rm, etc. but files are stubs Copy program to move files in and out of Copy program to move files in and out of

mass storage, encpmass storage, encp Copy program to move files in and out of Copy program to move files in and out of

mass storage via dCache, dccp, ftp, gridftp, mass storage via dCache, dccp, ftp, gridftp, SRM SRM

Details of the storage all hidden behind the Details of the storage all hidden behind the file system interface and copy interfacesfile system interface and copy interfaces

Page 11: Fermilab Mass Storage System

ArchitectureArchitecture Based on commodity PCs running Based on commodity PCs running

Linux. Typically dual Xeon 2.4GHzLinux. Typically dual Xeon 2.4GHz Mover nodes for data transferMover nodes for data transfer Server nodes for managing metadata – Server nodes for managing metadata –

library, volume, file, pnfs name-space, library, volume, file, pnfs name-space, logs, media changer. External RAID logs, media changer. External RAID Arrays. postgreSQLArrays. postgreSQL

Administrative, monitor, door, and pool Administrative, monitor, door, and pool nodes for dCachenodes for dCache

Page 12: Fermilab Mass Storage System

ArchitectureArchitecture Technology independentTechnology independent

Media changers isolate library Media changers isolate library technologytechnology

Movers isolate tape drive technologyMovers isolate tape drive technology

Flexible policies for managing Flexible policies for managing requests and accommodating ratesrequests and accommodating rates

Fair share for owners of drivesFair share for owners of drives Request PriorityRequest Priority Family width Family width

Page 13: Fermilab Mass Storage System

ArchitectureArchitecture Number of nodes of any type unrestrictedNumber of nodes of any type unrestricted Number of active users is unrestrictedNumber of active users is unrestricted Scale Scale

storage by building out librariesstorage by building out libraries transfer rates by building out movers, tape transfer rates by building out movers, tape

drivesdrives Request load by building out serversRequest load by building out servers

Client’s needs are somewhat unpredictable, Client’s needs are somewhat unpredictable, typically must scale on demand as neededtypically must scale on demand as needed

Page 14: Fermilab Mass Storage System

Managing File & Metadata Managing File & Metadata IntegrityIntegrity

File protection procedural policies File protection procedural policies File removed by user are not deletedFile removed by user are not deleted Separation of roles – owner in the loop on recyclesSeparation of roles – owner in the loop on recycles Physical protection against modification of filled tapesPhysical protection against modification of filled tapes ““cloning” of problem or tapes with large mount countscloning” of problem or tapes with large mount counts

File protection software policiesFile protection software policies Extensive CRC checking throughout all transactionsExtensive CRC checking throughout all transactions Automated periodic CRC checking of randomly selected Automated periodic CRC checking of randomly selected

files on tapesfiles on tapes Flexible read-after-write policyFlexible read-after-write policy Automated CRC checking of dCache filesAutomated CRC checking of dCache files Automated disabling of access to tapes or tape drives Automated disabling of access to tapes or tape drives

that exhibit read or write problems.that exhibit read or write problems.

Page 15: Fermilab Mass Storage System

Managing File & Metadata Managing File & Metadata Integrity Integrity

Metadata protection Metadata protection Redundancy in database information Redundancy in database information

(file, volume, pnfs)(file, volume, pnfs) RAID RAID Backups/journaling – 1-4 hr cyclic and Backups/journaling – 1-4 hr cyclic and

archived to tapearchived to tape Automated database ChecksAutomated database Checks Replicated databases (soon)Replicated databases (soon)

Page 16: Fermilab Mass Storage System

Migration & technologyMigration & technology Migrate to denser media to free storage spaceMigrate to denser media to free storage space Evolve to newer tape technologiesEvolve to newer tape technologies Normal part of workflow – users never loose access to Normal part of workflow – users never loose access to

files.files. Built in tools to perform migration, but staff always in Built in tools to perform migration, but staff always in

the loopthe loop Recent efforts:Recent efforts:

2004 CDF: 4457 60GB 9940A to 200GB 9940B cartridges. 2004 CDF: 4457 60GB 9940A to 200GB 9940B cartridges. 3000 slots and tapes freed for reuse. Only 42 file problems 3000 slots and tapes freed for reuse. Only 42 file problems encountered and fixed. encountered and fixed.

2004-2005 stken: 1240 20GB Eagle tapes to 200GB 9940B 2004-2005 stken: 1240 20GB Eagle tapes to 200GB 9940B tapes on general purpose stken system freeing 1000 slots.tapes on general purpose stken system freeing 1000 slots.

2005 Migration of 9940A to 9940B on stken about to start2005 Migration of 9940A to 9940B on stken about to start And on it goes. Have LTO-1 to LT0-2 still to migrate And on it goes. Have LTO-1 to LT0-2 still to migrate

Page 17: Fermilab Mass Storage System

Administration & Administration & MaintenanceMaintenance

The most difficult aspect to scaleThe most difficult aspect to scale Real time 24 hours/day requires Real time 24 hours/day requires

diligent monitoring diligent monitoring 4 Administrators, 3 enstore, 3.5 4 Administrators, 3 enstore, 3.5

dCache developersdCache developers 24x7 vendor support on tape 24x7 vendor support on tape

libraries and tape drives.libraries and tape drives.

Page 18: Fermilab Mass Storage System

Administrators Administrators Rotate 24x7 on call primary and Rotate 24x7 on call primary and

secondary secondary Monitor and maintain 7 tape libraries Monitor and maintain 7 tape libraries

with > 100 tape drives, 150 PCs, 100 file with > 100 tape drives, 150 PCs, 100 file serversservers

Recycle volumesRecycle volumes Monitor capacity vs. useMonitor capacity vs. use Clone overused tapesClone overused tapes Troubleshoot problemsTroubleshoot problems Install new hardware and softwareInstall new hardware and software

Page 19: Fermilab Mass Storage System

Administrative supportAdministrative support Automatic generation of alarms by Automatic generation of alarms by

Enstore and dCache software. Generate Enstore and dCache software. Generate tickets and page administrators. tickets and page administrators.

In-house on-hour helpdesk/off hour call In-house on-hour helpdesk/off hour call centers generate pages and tickets.centers generate pages and tickets.

Extensive web based plots, logs, status, Extensive web based plots, logs, status, statistics and metrics, alarms, and system statistics and metrics, alarms, and system information.information.

Animated graphical view of Enstore Animated graphical view of Enstore resources and data flows. resources and data flows.

Page 20: Fermilab Mass Storage System

Administration Administration MonitoringMonitoring

States of the Enstore serversStates of the Enstore servers Amount of Enstore resources such as tape quotas, Amount of Enstore resources such as tape quotas,

number of working moversnumber of working movers User request queuesUser request queues Plots of data movement, throughput, and tape Plots of data movement, throughput, and tape

mountsmounts Volume informationVolume information Generated alarmsGenerated alarms Completed transfersCompleted transfers Computer uptime and accessibilityComputer uptime and accessibility Resource usage (memory, CPU, disk space)Resource usage (memory, CPU, disk space) Status and space utilization on dCache systemsStatus and space utilization on dCache systems

Page 21: Fermilab Mass Storage System

Admin metrics, random Admin metrics, random weekweek

1 9940B, 1 9940A, 1 1 9940B, 1 9940A, 1 LTO2 replacedLTO2 replaced

3 mover interventions3 mover interventions 4 server interventions4 server interventions 2 tape drive 2 tape drive

interventionsinterventions 2 fileserver (dCache) 2 fileserver (dCache)

interventionsinterventions 3 tape interventions3 tape interventions

4 file interventions4 file interventions 40 tapes 40 tapes

labeled/enteredlabeled/entered 2 tapes cloned2 tapes cloned 3 Enstore service 3 Enstore service

requestsrequests 1 data integrity 1 data integrity

issueissue

Page 22: Fermilab Mass Storage System

Administration Administration ObservationsObservations

Find for every drive we own, we Find for every drive we own, we replace 1/yr.replace 1/yr.

With our usage patterns, we haven’t With our usage patterns, we haven’t discerned cartridge/drive reliability discerned cartridge/drive reliability differences between LTO and 9940differences between LTO and 9940

Lots of tape, drive interventionsLots of tape, drive interventions Large distributed system requires Large distributed system requires

complex correlation of information to complex correlation of information to diagnosediagnose

Page 23: Fermilab Mass Storage System

PerformancePerformance

Page 24: Fermilab Mass Storage System

ConclusionConclusion Constructed a Multi-Petabyte tertiary Constructed a Multi-Petabyte tertiary

store that can easily scale to meet store that can easily scale to meet storage, data integrity, transaction, and storage, data integrity, transaction, and transfer rate demandstransfer rate demands

Can support different nd new underlying Can support different nd new underlying tape and library technologiestape and library technologies

Current store is 2.6PB @ 1PB/yr, Current store is 2.6PB @ 1PB/yr, 20TB/day expect to increase fourfold in 20TB/day expect to increase fourfold in 2007 (CMS tier 1 goes online)2007 (CMS tier 1 goes online)