PROOF/Xrootd for a Tier3
Embed Size (px)
description
Transcript of PROOF/Xrootd for a Tier3
-
PROOF/Xrootd for a Tier3Mengmeng Chen, Michael Ernst, Annabelle Leung, Miron Livny, Bruce Mellado, Sergey Panitkin, Neng Xu and Sau Lan WuBNL/Wisconsin
Special thanks to Gerri Ganis, Jan Iwaszkiewicz, Fons Rademakers, Andy Hanushevsky, Wei Yang, Dan Bradley, Sridhara Dasu, Torre Wenaus and the BNL team SLAC meeting Tools meeting, 11/28/07
-
OutlineIntroduction
PROOF benchmarks
Our views on Tier3
A Multilayer Condor System
PROOF and Condors COD
The I/O Queue
Data Redistribution in Xrootd
Outlook and Plans
-
PROOF/XROOTDWhen data comes it will not be possible for the physicist to do analysis with ROOT in one node due to large data volumesNeed to move to a model that allows parallel processing for data analysis, or distributed analysis. As far as software for distributed analysis goes US ATLAS is going for the xrootd/PROOF systemXrootd is a set of tools for serving data, maintained by SLAC which is proven to support up to 1000 nodes with no scalability problems within this rangePROOF (the Parallel ROOT Facility, CERN) is an extension of ROOT allowing transparent analysis of large sets of ROOT files in parallel on compute clusters or multi-core computersSee Sergey Panitkins talk at the PROOF workshop at CERN on Thursday overviewing ATLAS efforts/experience
-
PROOF in a SlidePROOF: Dynamic approach to end-user HEP analysis on distributed systems exploiting the intrinsic parallelism of HEP dataAnalysis Facility, Tier3
-
The end Point: ScalabilityCourtesy of PROOF team
-
Structure of PROOF pool: Redirector Worker Supervisor Procedure of PROOF job: User submit the PROOF job Redirector find the exact location of each file Workers validate each file Workers process the root file Master collects the results and sends to user User make the plots Packetizers. They work like job schedulers. TAdaptivePacketizer (Default one, with dynamic packet size) TPacketizer (Optional one, with fixed packet size) TForceLocalPacktizer (Special one, no network traffic between workers. Workers only deal with the file stored locally)To be optimized for the Tier3Some TechnicalDetails
-
Xrootd test farm at ACF BNL10 machines allocated so far for Xrootd test farmTwo dual core Opteron CPUs at 1.8 Ghz per node8 GB RAM per node 4x 500 GB SATA drives per node, configured as a 2 TB partitionGigabit network5 node configuration used for tests1 redirector + 4 data servers20 CPU cores~10 TB of available disk space Behind ACF firewall, e.g visible from ACF only2 people involved in set up, installation, configuration, etc ~0.25 FTE
-
Xrootd/PROOF Tests at BNLEvaluation of Xrootd as a data serving technology Comparison to dCache and NFS serversAthena single client performance with AODsI/O optimization for dCache and XrootdAthena TAG based analysis performance studiesAthena scalability studies with AODsEvaluation of Xrootd/PROOF for root based analysesProof of the principle tests (factor of N scaling)Real analyses (Cranmer, Tarrade, Black, Casadei,Yu....)HighPtView, HiggsStarted evaluation of different PROOF packetizers Evaluation and tests of the monitoring and administrative setupIntegration with patena and Atlas DDM (T. Maeno)Disk I/O benchmarks, etc
Sergey Panitkin
-
Integration with Atlas DDMTested by Tadashi Maeno(See demonstration tomorrow)
Sergey Panitkin
-
Big pool 1 Redirector + 86 computers 47 AMD 4x2.0GHz cores, 4GB memory 39 Pentium4 2x2.8GHz, 2GB memory We use just the local disk for performance tests Only one PROOF worker run each node Small pool A 1 Redirector + 2 computers 4 x AMD 2.0GHz cores, 4GB memory, 70GB disk Best performance with 8 workers running on each node Small pool B 1 Redirector + 2 computers 8 x Intel 2.66GHz cores, 16GB memory, 8x750GB on RAID 5 Best performance when 8 workers running on each node, mainly for high performance testsPROOF test farms at GLOW-ATLAS
-
Xrootd/PROOF Tests at GLOW-ATLAS (Jointly with PROOF team)Focused on needs of a university-based Tier3Dedicated farms for data analysis, including detector calibration and performance, and physics analysis with high level objectsVarious performance test and optimizationsPerformance in various hardware configurationsResponse to different data formats, volumes and file multiplicitiesUnderstanding system with multiple usersDeveloping new ideas with the PROOF team Tests and optimization of packetizersUnderstanding the complexities of the packetizers
-
PROOF test webpagehttp://www-wisconsin.cern.ch/~nengxu/proof/
-
The Data FilesBenchmark files:Big size benchmark files (900MB)Medium size benchmark files (400MB)Small size benchmark files (100MB)ATLAS format files:EV0 files(50MB)
The ROOT version URL: http://root.cern.ch/svn/root/branches/dev/proofRepository UUID: 27541ba8-7e3a-0410-8455-c3a389f83636Revision: 21025
-
The Data Processing settingsBenchmark files (Provided by PROOF team):With ProcOpt.C (Read 25% of the branches)With Pro.C (Read all the branches)ATLAS format files (H DPD):With EV0.CMemory Refresh:After each PROOF job, the Linux kernel stores the data in the physical memory. When we process the same data again, the PROOF will read from memory instead of disk. In order to see the real disk I/O in the benchmark, we have to clean up the memory after each test.
-
What can we see from the results?How much resource PROOF jobs need:CPUMemoryDisk I/OHow does PROOF job use those resources:How to use a multi-core system?How much data does it load to the memory?How fast does it load to the memory?Where do the data go after processing? (Cached memory)
-
Disk I/OMemory UsageCached Memory 1 2 4 6 8 9 10 1 2 4 6 8 9 10 1 2 4 6 8 9 10 1 2 4 6 8 9 10 The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5. Number of WorkersCPU UsageBenchmark files, big size, read all the dataKB/sMB%%
-
Disk I/OMemory UsageCached Memory 1 2 4 3Number of WorkersCPU Usage 1 2 4 3 1 2 4 3 1 2 4 3Benchmark files, big size, read all the dataKB/sMB%%The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, SINGLE DISK
-
Disk I/OMemory UsageCached Memory 1 2 4 6 8 9 10 The jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory. 8 disks on RAID 5. Without Memory RefreshNumber of WorkersCPU UsageBenchmark files, big size, read all the data 1 2 4 6 8 9 10 1 2 4 6 8 9 10 1 2 4 6 8 9 10 KB/sMB%%
-
An Overview of the Performance rate
All the tests on same machine using default packetizerUsing Xrootd preload function seems to work wellShould not start more than 2 workers on single diskNumber of workersAverage Processing Speed (events/sec)
-
Our Views on a Tier3 at GLOW
Putting PROOF into Perspective
-
Main Issues to AddressNetwork TrafficAvoiding Empty CPU cyclesUrgent need for CPU resourcesBookkeeping, management and and processing of large amount of dataCore TechnologiesCONDORJob managementMySQLBookkeeping and file managementXROOTDStorage PROOFData analysis*
-
One Possible Way to go... *Computing Pool Computing nodes with small local disk.Storage Pool Centralized storage servers (NFS, Xrootd, Dcache, CASTOR) The gatekeeperTakes the production jobs fromGrid and submits to local pool.The usersSubmit their own jobs to the local pool.GRIDBatch System Normally uses Condor, PBS, LSF, etc.Heavy I/OLoadDedicated PROOF pool cpus cores + big disks.CPUs are idle most of the time.
-
The way we want to go...*Xrootd Pool cpus cores + big disks.Pure computing Pool cpus cores + small local diskStorage Pool very big disksLess I/OLoadGRIDThe gatekeeperTakes the production jobs fromGrid and submits to local pool.Local job SubmissionUsers own jobs to the whole pool.Proof jobs SubmissionUsers PROOF jobs to the Xrootd pool.
-
A Multi-layer Condor System*Production Queue No Pre-emption, Cover all the CPUs, Maximum 3 days, No number limitation. Suspension of ATHENA jobs well tested. Currently testing suspension on PANDA jobs.Local Job Queue For Private Jobs, No number limitation, No run time limitation, Cover all the CPUs, Higher priority.I/O Queue For I/O intensive jobs, No number limitation, No run time limitation, Cover the CPUs in Xrootd pool, Higher priorityFast Queue For high priority private Jobs, No number limitation, run time limitation, cover all the CPUs, half with suspension and half without, with highest priorityPROOF Queue (Condors COD ?) For PROOF Jobs, Cover all the CPUs, no affective to the condor queue, jobs get the CPU immediately.The gatekeeperTakes the production jobs fromGrid and submits to local pool.Local job SubmissionUsers own jobs to the whole pool.Proof jobs SubmissionUsers PROOF jobs to the Xrootd pool.
-
PROOF + Condors COD ModelLong Production or local Condor jobsPROOF jobsCondor MasterXrootd RedirectorCondor + Xrootd + PROOF poolCOD requestsPROOF requestsThe local storage on each machineUse Condors Computing-on-Demand to free-up nodes (in ~2-3 sec) running long jobs with local Condor systemA lot of discussion with PROOF team and Miron about integration of PROOF and CONDOR scheduling. May not need COD in the end.
-
*Local_xrd.shXrootd_sync.pyXrootd_sync.pyXrootd_sync.pyDATAXrootd File Tracking System Framework(to be integrated into LRC DB)
fileidpathtypeusermd5sumfsizetimedataserverstatus
-
The I/O Queue*Submitting node0. The tracking system provides file locations in the Xrood pool.1. Submission node asks Mysql database for the input file location.2. Database provides the location for file and also the validation info of the file.3. Submission node adds the location to the job requirement and submit to the condor system.4. Condor sends the job to the node where the input file stored.5. The node runs the job and puts the output file on the local disk.0. The tracking system provides file locations in the Xrood pool.
012345Mysql database serverXrootd Pool cpus cores + big disks.Condor master
-
I/O Queue Tests Direct Access Jobs go on machines where the input files reside Accesses ESD files directly and converts them to CBNTAA files Copies output file to xrootd on the same machine using xrdcp Each file has 250 events xrdcp Jobs go on any machines not necessarily on the ones which have the input files Copies input and output files via xrdcp to/from the xrootd pool Converts the input ESD file to CBNTAA cp_nfs Jobs go on any machine Copies input and output files to/from NFS Converts the input ESD file to CBNTAA
-
I/O Queue Test ConfigurationInput file (ESD files) size ~700MBOutput File (CBNTAA) size ~35MB
Each machine has ~10 ESD files 42 running nodes168 CPUs cores
-
I/O Queue Test ResultsNumber of jobsTime save per job: ~230sec
Chart1
709977
7191019
765972
726961
788982
I/O
XRDCP
Average running time (sec)
Direct Access vs Xrootd
Sheet1
I/OXRDCPColumn1
20709977
507191019
100765972
150726961
200788982
To resize chart data range, drag lower right corner of range.
-
I/O Queue Test ResultsNumber of jobs
Chart1
7099772090
71910194679
76597213827
72696115976
78898214576
I/O
XRDCP
NFS
Average Running time (sec)
XROOT vs NFS
Sheet1
I/OXRDCPNFS
207099772090
5071910194679
10076597213827
15072696115976
20078898214576
To resize chart data range, drag lower right corner of range.
-
Data Redistribution in XrootdWhen and why do we need data redistribution?Case 1: One of the data servers is dead. All the data on it got lost. Replace it with a new data server.Case 2: When we extend the Xrootd pool, we add new data servers into the pool. When new data comes, all the new data will go the new server because of the load balancing function of Xrootd. The problem is that if we run PROOF jobs on the new data, all the PROOF jobs will read from this new server.New machine to replace the bad one.This one is down.
-
An Example of Xrootd file Distribution Number of filesAll the files were copied through Xrootd redirectorThis machine was downThis node happened to be filled with most of the files in one data setComputer nodes in the xrootd pool
-
PROOF Performance on this DatasetHere is the problem
-
After File RedistributionNumber of filesComputer nodes in the xrootd pool
-
Before File RedistributionRunning TimeNumber of Workers Accessing FilesNumber of Workers Accessing FilesAfter File Redistribution
-
PROOF Performance after Redistribution
-
The Implementation of DBFRWe are working on a MySQL+Python based systemWe are trying to integrate this system into the LRC databaseHopefully, this system can be implemented at PROOF level because PROOF already work with datasets
-
SummaryXrootd/PROOF is an attractive technology for ATLAS physics analysis, specially for the post-AOD phaseThe work of understanding this technology is in progress by BNL and WisconsinSignificant experience has been gainedSeveral Atlas analysis scenarios were tested, with good resultsTested machinery on HighPtView, CBNT, EV for Higgs,Integration with DDM was testedMonitoring and farm management prototypes were testedScaling performance is under testWe think PROOF is a viable technology for Tier3Testing Condors Multilayer System and COD, Xrootd file tracking and data redistribution, I/O queueNeed to integrate developed DB and LRCNeed to resolve issue of multi-user utilization of PROOF
-
Additional Slides
-
The Basic Idea of DBFRRegister the location of all the files in every datasets in the database (MySQL)With this information, we can easily get the file distribution of each datasetCalculate the average number of the files each data server should handleGet a list of files which need to move out.Get a list of machines which dont have less files than the averageMatch these 2 lists and move the filesRegister the new location of those files
-
Disk I/OMemory UsageCached MemoryThe jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5. With Xrootd Preload.11/29/2007Number of WorkersCPU UsageBenchmark files, big size, read all the data 1 2 4 6 8 9 7 5 KB/sMB%% 1 2 4 6 8 9 7 5 1 2 4 6 8 9 7 5 1 2 4 6 8 9 7 5
-
Performance Rate
Xrootd preloading doesnt change disk throughput muchXrootd preloading helps to increase the top speed by ~12.5%.When we use Xrootd preload, disk I/O reaches ~60MB/secCPUs usage reached 60%.The best performance is achieved when the number of workers is less than the number of CPUs.(6 workers provides the best performance.)Number of workersAverage Processing Speed (events/sec)
-
Disk I/OMemory UsageCached MemoryThe jobs were running on a machine with Intel 8 core, 2.66GHz, 16GB DDR2 memory, 8 disks on RAID 5.Number of WorkersCPU Usage 1 2 4 6 7 8 9KB/sMB%%Benchmark files, big size, read 25% of the data 1 2 4 6 7 8 9 1 2 4 6 7 8 9 1 2 4 6 7 8 9
-
Performance Rate
Disk I/O reaches ~60MB/sec which is 5MB/sec more than reading all the data.CPUs usage reached 65%.8 workers provides the best performance.Number of workersAverage Processing Speed (events/sec)
*****