Hyper-Scaling Xrootd Clustering
Embed Size (px)
description
Transcript of Hyper-Scaling Xrootd Clustering

Hyper-Scaling Xrootd Clustering
Andrew HanushevskyStanford Linear Accelerator Center
Stanford University29-September-2005
http://xrootd.slac.stanford.edu
Root 2005 Users WorkshopCERN
September 28-30, 2005

29-September-05 2: http://xrootd.slac.stanford.edu
Outline
Xrootd Single Server Scaling
Hyper-Scaling via Clustering Architecture Performance
Configuring Clusters Detailed relationships Example configuration Adding fault-tolerance
Conclusion

29-September-05 3: http://xrootd.slac.stanford.edu
Latency Per Request (xrootd)

29-September-05 4: http://xrootd.slac.stanford.edu
Capacity vs Load (xrootd)

29-September-05 5: http://xrootd.slac.stanford.edu
xrootd Server Scaling
Linear scaling relative to load Allows deterministic sizing of server
Disk NIC Network Fabric CPU Memory
Performance tied directly to hardware cost

29-September-05 6: http://xrootd.slac.stanford.edu
Hyper-Scaling
xrootd servers can be clustered Increase access points and available data
Complete scaling
Allow for automatic failover Comprehensive fault-tolerance
The trick is to do so in a way that Cluster overhead (human & non-human) scales linearly
Allows deterministic sizing of cluster
Cluster size is not artificially limited I/O performance is not affected

29-September-05 7: http://xrootd.slac.stanford.edu
Basic Cluster Architecture
Software cross bar switch Allows point-to-point connections
Client and data server I/O performance not compromised
Assuming switch overhead can be amortized
Scale interconnections by stacking switches Virtually unlimited connection points
Switch overhead must be very low

29-September-05 8: http://xrootd.slac.stanford.edu
Single Level Switch
ClientClient RedirectorRedirector(Head Node)
Data ServersData Servers
open file X
AA
BB
CC
go to C
open file X
Who has file X?
I have
Cluster
Client sees all servers as xrootd data serversClient sees all servers as xrootd data servers
2nd open X
go to C
RedirectorsRedirectorsCache fileCache filelocationlocation

29-September-05 9: http://xrootd.slac.stanford.edu
Two Level Switch
ClientClient
RedirectorRedirector(Head Node)
Data ServersData Serversopen file X
AA
BB
CC
go to Copen file X
Who has file X?
I have
Cluster
Client sees all servers as xrootd data serversClient sees all servers as xrootd data servers
SupervisorSupervisor((sub-redirectorsub-redirector))
Who has file X? DD
EE
FF
I havego to F
open file X
I have

29-September-05 10: http://xrootd.slac.stanford.edu
Making Clusters Efficient
Cell size, structure, & search protocol are critical Cell Size is 64
Limits direct inter-chatter to 64 entities Compresses incoming information by up to a factor of 64 Can use very efficient 64-bit logical operations
Hierarchical structures usually most efficient Cells arranged in a B-Tree (i.e., B64-Tree) Scales 64h (where h is the tree height)
Client needs h-1 hops to find one of 64h servers (2 hops for 262,144 servers) Number of responses is bounded at each level of the tree
Search is a directed broadcast query/rarely respond protocol Provably best scheme if less than 50% of servers have the wanted file
Generally true if number of files >> cluster capacity Cluster protocol becomes more efficient as cluster size increases

29-September-05 11: http://xrootd.slac.stanford.edu
Cluster Scale Management
Massive clusters must be self-managing Scales 64n where n is height of tree
Scales very quickly (642 = 4096, 643 = 262,144) Well beyond direct human management capabilities
Therefore clusters self-organize Single configuration file for all nodes Uses a minimal spanning tree algorithm
280 nodes self-cluster in about 7 seconds 890 nodes self-cluster in about 56 seconds
Most overhead is in wait time to prevent thrashing

29-September-05 12: http://xrootd.slac.stanford.edu
Clustering Impact
Redirection overhead must be amortized This is deterministic process for xrootd
All I/O is via point-to-point connections Can trivially use single-server performance data
Clustering overhead is non-trivial 100-200us additional for an open call Not good for very small files or short “open” times
However, compatible with the HEP access patterns

29-September-05 13: http://xrootd.slac.stanford.edu
Detailed Cluster Architecture
A cell is 1-to-64 entities (servers or cells)clustered around a cell manager
The cellular process is self-regulating and creates aB-64 TreeB-64 Tree
MHead Node

29-September-05 14: http://xrootd.slac.stanford.edu
The Internal Details
datadataxrootd
olbdxrootd
olbd
Data Clients
Redirectors
Data Servers
MM
SS
ctlctl
olbdolbd Control Network
Managers, Supervisors & Servers(resource info, file location)
xrootdxrootd Data Network
(redirectors steer clients to dataData servers provide data)
xrootdxrootd
ofs odc olbolb

29-September-05 15: http://xrootd.slac.stanford.edu
Schema Configuration
RedirectorsRedirectors(Head Node)
Data ServersData Servers(end-node(end-node))
SupervisorsSupervisors((sub-redirector)sub-redirector)
ofs.redirect remoteofs.redirect remoteodc.manager odc.manager host porthost port
olb.role managerolb.role managerolb.port olb.port portport
olb.allow olb.allow hostpathostpat
olb.role serverolb.role serverolb.subscribe olb.subscribe host porthost port
ofs.redirect targetofs.redirect target
olb.role supervisorolb.role supervisorolb.subscribe olb.subscribe host porthost port
olb.allow olb.allow hostpathostpat
ofs.redirect remoteofs.redirect remoteofs.redirect targetofs.redirect target
xrootdxrootd
ofs odc olbolb

29-September-05 16: http://xrootd.slac.stanford.edu
Example: SLAC Configuration
client machinesclient machines
kan01 kan02 kan03 kan04 kanxx
kanrdr01 kanrdr02 kanrdr-a
Hidden Details

29-September-05 17: http://xrootd.slac.stanford.edu
Configuration File
if kanrdr-a+ olb.role manager olb.port 3121 olb.allow host kan*.slac.stanford.edu ofs.redirect remote odc.manager kanrdr-a+ 3121else olb.role server olb.subscribe kanrdr-a+ 3121 ofs.redirect targetfi
xrootdxrootd
ofs odc olbolb

29-September-05 18: http://xrootd.slac.stanford.edu
Potential Simplification?
if kanrdr-a+ olb.role manager olb.port 3121 olb.allow host kan*.slac.stanford.edu ofs.redirect remote odc.manager kanrdr-a+ 3121else olb.role server olb.subscribe kanrdr-a+ 3121 ofs.redirect targetfi
olb.port 3121all.role manager if kanrdr-a+all.role server if !kanrdr-a+ all.subscribe kanrdr-a+olb.allow host kan*.slac.stanford.edu
Is the simplification really better? Is the simplification really better? We’re not sure, what do We’re not sure, what do youyou think? think?
xrootdxrootd
ofs odc olbolb

29-September-05 19: http://xrootd.slac.stanford.edu
Adding Fault Tolerance
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
olbdolbd
xrootdxrootd
Fully ReplicateFully Replicate
Hot SparesHot Spares
Data ReplicationRestagingRestaging
Proxy Search**
Manager(Head Node)
Supervisor(Intermediate Node)
Data Server(Leaf Node)
^xrootd has builtin proxy support today; discriminating proxies will be available in a near future release.^xrootd has builtin proxy support today; discriminating proxies will be available in a near future release.

29-September-05 20: http://xrootd.slac.stanford.edu
Conclusion
High performance data access systems achievable The devil is in the details
High performance and clustering are synergetic Allows unique performance, usability, scalability, and
recoverability characteristics
Such systems produce novel software architectures Challenges
Creating applications that capitilize on such systems
Opportunities Fast low cost access to huge amounts of data to speed discovery

29-September-05 21: http://xrootd.slac.stanford.edu
Acknowledgements
Fabrizio Furano, INFN Padova Client-side design & development
Principal Collaborators Alvise Dorigo (INFN), Peter Elmer (BaBar), Derek Feichtinger (CERN),
Geri Ganis (CERN), Guenter Kickinger (CERN), Andreas Peters (CERN), Fons Rademakers (CERN), Gregory Sharp (Cornell), Bill Weeks (SLAC)
Deployment Teams FZK, DE; IN2P3, FR; INFN Padova, IT; CNAF Bologna,
IT; RAL, UK; STAR/BNL, US; CLEO/Cornell, US; SLAC, US
US Department of Energy Contract DE-AC02-76SF00515 with Stanford University