Download - Xrootd usage @ LHC

Transcript
Page 1: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Xrootd usage @ LHC

An up-to-date technical survey about xrootd-based storage solutions

Page 2: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Outline

• Intro– Main use cases in the storage arena

• Generic Pure xrootd @ LHC– The Atlas@SLAC way– The Alice way

• CASTOR2• Roadmap• Conclusions

Page 3: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Introduction and use cases

Page 4: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The historical Problem: data access• Physics experiments rely on rare events and statistics

– Huge amount of data to get a significant number of events• The typical data store can reach 5-10 PB… now• Millions of files, thousands of concurrent clients

– The transaction rate is very high• Not uncommon O(103) file opens/sec per cluster

– Average, not peak– Traffic sources: local GRID site, local batch system, WAN

• Up to O(104) clients per server!• If not met then the outcome is:

– Crashes, instability, workarounds, “need” for crazy things

• Scalable high performance direct data access– No imposed limits on performance and size, connectivity– Higher performance, supports WAN direct data access– Avoids WN under-utilization– No need to do inefficient local copies if not needed

• Do we fetch entire websites to browse one page?

Page 5: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The Challenges

• LHC User Analysis

• Boundary Conditions– GRID environment

• GSI authentication• User space deployment

– CC environment• Kerberos, - admin deployment

• High I/O load• Moderate Namespace load• Many clients O(1000-10000)

Sequential File Access Sparse File Access

Basic Analysis (today)RAW, ESD

Advanced Analysis (tomorrow)ESD,AOD, Ntuple, Histograms

Batch Data Access Interactive Data Access

RAP root, dcap,rfio ....

MFS Mounted File Systems

• T0/T3 @ CERN

• Preferred interface is MFS– Easy, intuitive, fast response, standard

applications

– Moderate I/O load

– High Namespace load • Compilation

• Software startup

• searches

• Less Clients O(#users)

Page 6: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Main requirement

• Data access has to work reliably at the desired scale– This also means:

• It has not to waste resources

Page 7: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F. Furano, A. Hanushevsky - Scalla/xrootd WAN globalization tools: where we are. (CHEP09)

A simple use case

• I am a physicist, waiting for the results of my analysis jobs– Many bunches, several outputs

• Will be saved e.g. to an SE at CERN

– My laptop is configured to show histograms etc, with ROOT

– I leave for a conference, the jobs finish while in the plane– When there, I want to simply draw the results from my

home directory– When there, I want to save my new histos in the same

place– I have no time to loose in tweaking to get a copy of

everything. I loose copies into the confusion.– I want to leave the things where they are.

I know nothing about things to tweak.

What can I expect? Can I do it?

Page 8: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F. Furano, A. Hanushevsky - Scalla/xrootd WAN globalization tools: where we are. (CHEP09)

Another use case

• ALICE analysis on the GRID• Each job reads ~100-150MB from

ALICE::CERN::SE• These are cond data accessed directly, not file copies

– I.e. VERY efficient, one job reads only what it needs.• It just works, no workarounds

– At 10-20MB/s it takes 5-10 secs (most common case)– At 5MB/s it takes 20secs– At 1MB/s it takes 100

• Sometimes data are accessed elsewhere– Alien allows to save a job by making it read data from

a different site. Very good performance

• Quite often the results are written/merged elsewhere

Page 9: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Pure Xrootd

Page 10: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

xrootd Plugin Architecture

lfn2pfnprefix encoding

Storage System(oss, drm/srm, etc)

authentication(gsi, krb5, etc)

Clustering(cmsd)

authorization(name based)

File System(ofs, sfs, alice, etc)

Protocol (1 of n)(xrootd)

Protocol Driver(XRD)

Page 11: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The client side

• Fault tolerance in data access– Meets WAN requirements, reduces jobs mortality

• Connection multiplexing (authenticated sessions)• Up to 65536 parallel r/w requests at once per client process• Up to 32767 open files per client process

– Opens bunches of up to O(1000) files at once, in parallel– Full support for huge bulk prestages

• Smart r/w caching– Supports normal readaheads and “Informed Prefetching”

• Asynchronous background writes– Boosts writing performance in LAN/WAN

• Sophisticated integration with ROOT– Reads in advance the “right” chunks while the app computes the

preceding ones– Boosts read performance in LAN/WAN (up to the same order)

Page 12: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The Xrootd “protocol”

• The XRootD protocol is a good one• Efficient, clean, supports fault-tolerance etc. etc…

– It doesn’t do any magic, however• It does not multiply your resources• It does not overcome hw bottlenecks• BUT it allows the true usage of the hw resources

– One of the aims of the project still is sw quality• In the carefully crafted pieces of sw which come with the distribution

– What makes the difference with Scalla/XRootD is:• Scalla/XRootD Implementation details (performance + robustness)

– And bad performance can hurt robustness (and vice-versa)• Scalla SW architecture (scalability + performance + robustness)• Designed to fit the HEP requirements• You need a clean design where to insert it• Born with efficient direct access in mind

– But with the requirements of high performance computing– Copy-like access becomes a particular case

Page 13: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Pure Xrootd @ LHC

Page 14: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The Atlas@SLAC way with XROOTD

• Pure Xrootd + Xrootd-based “filesystem” extension• Adapters to talk to BestMan SRM and GridFTP

• More details in A.Hanushevsky’s talk @ CHEP09

xrootd/cmsd/cnsd

FUSE

ADAPTER

FUSE

GridFTP

Fire Wall

SRM

Page 15: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The ALICE way with XROOTD

• Pure Xrootd + ALICE strong authz plugin. No difference among T1/T2 (only size and QOS)• WAN-wide globalized deployment, very efficient direct data access• CASTOR at Tier-0 serving data, Pure Xrootd serving conditions to the GRID jobs• “Old” DPM+Xrootd in several tier2s

Xrootd site(GSI)

A globalized clusterALICE global redirector

Local clients workNormally at each

site

Missing a file?Ask to the global redirectorGet redirected to the right

collaborating cluster, and fetch it.Immediately.

A smart clientcould point here

Any otherXrootd site

Xrootd site(CERN)

Cmsd

Xrootd

VirtualMassStorageSystem… built on data Globalization

More details and complete info in “Scalla/Xrootd WAN globalization tools: where we are.” @ CHEP09

Page 16: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

CASTOR2

Putting everything together @ Tier0/1s

Page 17: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The CASTOR way

• Client connects to a redirector node• The redirector asks CASTOR where the file is• Client then connects directly to the node holding the data• CASTOR handles tapes in the back

Disk Servers

Red

irec

tor A

B

C

Client

Open file X

Go to C

CASTORWhere is X ?

On C

Tape

bac

kend

Trigger migration/recall

Credits: S.Ponce (IT-DM)

Page 18: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

CASTOR 2.1.8Improving Latency - Read

• 1st focus on file (read) open latencies

Estimate

1

10

100

1000

ms

Castor 2.1.7(rfio)

Castor 2.1.8(xroot) Castor 2.1.9

(xroot)

October 2008

Network Latency Limit

Read Open Latencies

Credits: A.Peters (IT-DM)

Page 19: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Estimate

CASTOR 2.1.8Improving Latency – Metadata Read

• Next focus on meta data (read) latencies

1

10

100

1000

ms

Castor 2.1.7 Castor 2.1.8 Castor 2.1.9

October 2008

Network Latency Limit

Stat Latencies

Credits: A.Peters (IT-DM)

Page 20: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Prototype - Architecture XCFS Overview - xroot + FUSE

DATA FS

Authz

CLI

EN

T

xcfsd

libXrdPosix

libXrdClient

/dev/fuse

VFS

Client Application

glibc

libfuse FUSE LL Implementation

XROOT Posix Library

XROOT Client Library

Posix access to /xcfs(i.e. a generic application)

libXrdCatalogFs

xrootd

libXrdSec<plugin>

xrootd

libXrdSec<plugin>

libXrdCatalogFs

xrootd

libXrdSec<plugin>libXrdCatalogOfs

xrootd

libXrdSecUnix

XFS

FS

Name Space Provider

Meta Data FilesystemlibXrdCatalogAuthz

Strong Auth Plugin

xrootd server daemon

Remote Access Protocol

(ROOT plugs here)

DIS

K S

ER

VE

R

MD

SE

RV

ER

Capability

Credits: A.Peters (IT-DM)

Page 21: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Early Prototype - Evaluation Meta Data Performance

• File Creation*

• File Rewrite

• File Read

• Rm

• Readdir/StatAccess

• ~1.000/s

• ~2.400/s

• ~2.500/s

• ~3.000/s

• Σ = 70.000/s

*These values have been measured executing shell commands on 216mount clients. Creation performance decreases with the filling of the namespace on a spinning medium. Using an XFS filesystem over a DRBD blockdevicein a high-availability setup file creation perfromance stabilizes at 400/s (20 Mio files in the namespace)

Credits: A.Peters (IT-DM)

Page 22: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Network usage (or waste!)

• Network traffic is an important factor – it has to match the ratio IO(CPU Server) / IO(Disk Server)– Too much unneeded traffic means fewer clients supported (serious bottleneck:

1 client works well, 100-1000 clients do not at all)– Lustre doesn't disable readahead during forward-seeking access and transfers

the complete file if reads are found in the buffer cache (readahead window starts with 1M and scales up to 40 M)

• XCFS/LUSTRE/NFS4 network volume without read-ahead is based on 4k pages in Linux– Most of the requests are not page aligned and result in additional pages to be

transferred (avg. read size 4k), hence they xfer twice as much data (but XCFS can skip this now!)

– 2nd execution plays no real role for analysis since datasets are usually bigger than client buffer cache

Credits: A.Peters (IT-DM) – ACAT2008

Page 23: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Why is that useful?• Users can access data by LFN

without specification of the stager

• Users are automatically directed to 'their' pool with write permissions

CASTOR 2.1.8-6Cross Pool Redirection

T3Stager

T0Stager

X X

X

ManagerServer

Server

Meta Manager

NameSpace

• T3 pool subscribed•r/w for /castor/user•r/w for /castor/cms/user/

• T0 pool subscribed•ro for /castor•ro for /castor/cms/data

Example Configuration

There are even more possibilitiesif a part of the namespace can be assigned to individual pools for write operations.

X xrootd

cmsd(cluster management)

Manager

Credits: A.Peters (IT-DM)

Page 24: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Towards a Production VersionFurther Improvements – Security

• GSI/VOMS authentication plugin prototype developed based on pure OpenSSL – using additionally code from mod_ssl & libgridsite– significantly faster than GLOBUS implementation

• After Security Workshop with A.Hanushevsky Virtual Socket Layer introduced into xrootd authentication plugin base to allow socket oriented authentication over xrootd protocol layer– Final version should be based on OpenSSL and

VOMS library

Virtual Socket

VirtualSocket

Page 25: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

The roadmap

Page 26: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

XROOT Roadmap @CERN

• XROOT is strategic for scalable analysis support with CASTOR at CERN / T1s

• will support other file access protocols until they become obsolete

• CASTOR• Secure RFIO has been released in 2.1.8• deployment impact in terms of CPU may be significant

– Secure XROOT is default in 2.1.8 (Kerb. or X509)• Expect to lower CPU cost than rfio due to session

model• No plans to provide un-authenticated access via

XROOT

Page 27: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

XROOTD Roadmap

• CASTOR– Secure RFIO has been released in 2.1.8– deployment impact in terms of CPU may be significant

• Secure XROOT is default in 2.1.8 (Kerb. or X509)• Expect to lower CPU cost than rfio due to session model• No plans to provide un-authenticated access via XROOT

• DPM– support for authentication via xrootd is scheduled start

certification begin of July

• dCache– Relies on a custom full re-implementation of XROOTD protocol– protocol docs have been updated by A. Hanushevsky– in contact with CASTOR/DPM team to add

authentication/authorisation on the server side– evaluating common client plug-in / security protocol

Page 28: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

F.Furano (CERN IT-DM) - Xrootd usage @ LHC

Conclusion

• A very dense roadmap• Many, many tech details• Heading for

– Solid and high performance data access• For production and analysis

– More advanced user analysis scenarios– Need to match existing architectures, protocols

and workarounds

Page 29: Xrootd usage @ LHC

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Thank you

Questions?