Post on 20-Jan-2016
Lessons Learned from Managing a Petabyte
Jacek BeclaStanford Linear Accelerator Center (SLAC)
Daniel Wangnow University of CA in Irvine,formerly SLAC
CIDR’05, Asilomar, CA
2 of 18
Roadmap
Who we are Simplified data processing Core architecture and migration Challenges/surprises/problems Summary
Don’t miss the “lessons”, just look for yellow stickers
CIDR’05, Asilomar, CA
3 of 18
Who We Are
Stanford Linear Accelerator Center– DoE National Lab, operated by Stanford University
BaBar– one of the largest High Energy Physics (HEP)
experiments online– in production since 1999– over petabyte of production data
HEP– data intensive science– statistical studies– needle in a haystack searches
CIDR’05, Asilomar, CA
4 of 18
Simplified Data Processing
CIDR’05, Asilomar, CA
5 of 18
A Typical Day in Life(SLAC only)
~8 TB accessed in ~100K files ~7 TB in/out tertiary storage 2-5 TB in/out SLAC ~35K jobs complete
– 2500 run at any given time– many long running jobs (up to few days)
CIDR’05, Asilomar, CA
6 of 18
Some of the
Data-related Challenges
Finding perfect snowflake(s) in an avalanche
Volume organizing data Dealing with I/O
– sparse reads– random access– small object size: o(100) bytes
Providing data for many tens of sites
CIDR’05, Asilomar, CA
7 of 18
More Challenges…
Data Distribution
~25 sites worldwide produce data– many more use it
Distribution pros/cons+ keep data close to users– makes administration tougher+ works as a backup
Kill two birds with one stone, replicate for availability as well as backup
CIDR’05, Asilomar, CA
8 of 18
Core Architecture
Mass Storage (HPSS)– tapes cost-effective &
more reliable than disks
160 TB disk cache, 40+ data servers Database engine: ODBMS: Objectivity/DB
– scalable thin-dataserver thick-client architecture– gives full control over data placement & clustering– ODBMS later replaced by system built within HEP
DB related code hidden behind transient-persistent wrapper
Consider all factors when choosing software and hardware
CIDR’05, Asilomar, CA
9 of 18
Reasons to Migrate
ODBMS not a mainstream– true for HEP and elsewhere– long term future
Locked in certain OSes/compilers Unnecessary DB overhead
– e.g. transactions for immutable data Maintenance at small
institutes Monetary cost
Build flexible system, be prepared for non-trivial changes. Bet on simplicity.
CIDR’05, Asilomar, CA
10 of 18
xrootd Data Server
Developed in-house– becoming de facto HEP standard now
Numerous must-have features, some hard to add to the commercial server– deferral– redirection– fault tolerance– scalability– automatic load balancing– proxy server
Larger systems depend more heavily on automation
CIDR’05, Asilomar, CA
11 of 18
More Lessons…
Challenges, Surprises, Problems
Organizing & managing data– Divide into mutable & immutable,
separate queryable data immutable easier to optimize, replicate & scale
– Decentralize metadata updates contention happens in unexpected places makes data mgmt harder still need some centralization
Fault tolerance– Large system likely to use commodity hardware
fault tolerance essential
Single technology likely not enough to efficiently manage petabytes
CIDR’05, Asilomar, CA
12 of 18
Main bottleneck: disk I/O– underlying persistency less important
than one’d expect– access patterns more important must
understand to derandomize I/O
Job mgmt/bookkeeping– better to stall jobs than to kill
Power, cooling, floor weight Admin
Challenges, Surprises, Problems(cont…)
Hide disruptive events by stalling data flow
CIDR’05, Asilomar, CA
13 of 18
On Bleeding Edge Since Day 1 Huge collection of interesting challenges…
– Increasing address space– Improving server code– Tuning and scaling whole system– Reducing lock collisions– Improving I/O– …many others
In summary– we made it work (big success), but…– continuous improvements were needed for
the first several years to keep up
When you push limits, expect many problems everywhere. Normal maxima are too small. Observe refine repeat
CIDR’05, Asilomar, CA
14 of 18
Uniqueness of …
Scientific Community
Hard to convince scientific community to use commercial products– BaBar: 5+ million lines of home grown, complex C++
Continuously look for better approaches– system has to be very flexible
Most data immutable Many smart people that can
build almost anything
Specific needs of your community can impact everything, including the system architecture
CIDR’05, Asilomar, CA
15 of 18
DB-related Effort
~4-5 core db developers since 1996– effort augmented by many physicists,
students and visitors
3 DBAs– since production started till recently– less than 3 now
system finally automated and fault tolerant
Automation. is the key to low-maintenance,fault tolerant, system
CIDR’05, Asilomar, CA
16 of 18
Lessons
Summary
Kill two birds with one stone, replicate for availability as well as backup
Consider all factors when choosing software and hardware
When you push limits, expect many problems everywhere. Normal maxima are too small. Observe refine repeat
Specific needs of your community can impact everything, including the system architecture
Automation. is the key to low-maintenance,fault tolerant, system
Larger systems depend more heavily on automation
Hide disruptive events by stalling data flow
Single technology likely not enough to efficiently manage petabytes
Organize data (mutable, immutable, queryable, …)
Build flexible system, be prepared for non-trivial changes. Bet on simplicity.
CIDR’05, Asilomar, CA
17 of 18
Problems @ Petabyte Frontierjust a few highlights…
How to cost-effectively backup a PB? How to provide fault tolerance with 1000s disks
– RAID 5 is not good enough
How to build low maintenance system? – “1 full-time person per 1 TB” does not scale
How to store the data? (tape anyone? )– consider all factors: cost, power, cooling, robustness
…YES, there are “new” problems beyond “known problems scaled up”
CIDR’05, Asilomar, CA
18 of 18
The
Summary
Great success– ODBMS based system, migration & 2nd generation– Some DoD projects are being built on ODBMS
Lots of useful experience with managing (very) large datasets– Would not be able to achieve all that
with any RDBMS (today)– Thin server thick client architecture works well– Starting to help astronomers (LSST) to manage
their petabytes