ALHAD .G. APTE HEAD, COMPUTER DIVISION, BHABHA …EU-IndiaGrid... · HEAD, COMPUTER DIVISION,...
Transcript of ALHAD .G. APTE HEAD, COMPUTER DIVISION, BHABHA …EU-IndiaGrid... · HEAD, COMPUTER DIVISION,...
“DAE GRID” (Grid Computing Activities in
Department of Atomic Energy, India)
ALHAD .G. APTE
HEAD, COMPUTER DIVISION,
BHABHA ATOMIC RESEARCH CENTER
MUMBAI - INDIA
• INTEGRATED PROBLEM SOLVING ENVIRONMENT AT
BARC, MUMBAI
• DAE GRID AND REGIONAL DAE-WLCG
• CHALLENGING ISSUES IN DAE GRID
• INTRENATIONAL COLLABORATION
• PRODUCTS DEVELOPED AND DEPLOYED AT LCG
• PARTICIPATION IN EU-INDIA GRID
• CONCLUSIONS
PRESENTATION OUTLINE
INTEGRATED INTEGRATED INTEGRATED INTEGRATED
PROBLEM SOLVINGPROBLEM SOLVINGPROBLEM SOLVINGPROBLEM SOLVING
ENVIRONMENT ENVIRONMENT ENVIRONMENT ENVIRONMENT
at BARC, Mumbaiat BARC, Mumbaiat BARC, Mumbaiat BARC, Mumbai
• HIGH PERFORMANCE COMPUTING SYSTEMS
• PARALLEL FILE SYSTEM
• HIGH RESOLUTION CLUSTER BASED VISUALIZATION
SYSTEMS
• SCATTERED RESOURCES BELONGING TO DIFFERENT
ADMIN DOMAINS
• GOAL TO PROVIDE SEAMLESS ACCESS TO ALL THESE
RESOURCES TO ENSURE OPTIMAL UTILIZATION
INTEGRATED PROBLEM SOLVING INTEGRATED PROBLEM SOLVING INTEGRATED PROBLEM SOLVING INTEGRATED PROBLEM SOLVING ENVIRONMENT at BARCENVIRONMENT at BARCENVIRONMENT at BARCENVIRONMENT at BARC
• ANUPAM – Ameya
512 (256*2)CPU, 3.6 GHz Gigabit Ethernet Network
HPL Benchmark:
1.73 TFLOPS
• ANUPAM – Ajeya
1152 (288*4) Core, 2.66 GHz Infiniband
(4x DDR = 20Gbps)
HPL Benchmark: 9 TFLOPS
HPC Clusters
Monitoring & Accounting Tools
Rendering Cluster
• 1 Master Client
• 18 Servers
• 3.2 GHz dual processor, 2GB DDR-II RAM
• Graphics Cards
High Resolution Display
• Tiled 6 x 6 LCD Panels
• 47 million resolution
Rendering Cluster & High Resolution Display
Master Client
Win32/Xlib
ChromiumOpenGL
GraphicsHardware
NETWORK
Graphics Hardware
Rendering Server 36
Projector/Monitor
Graphics Hardware
Rendering Server 1
Projector/Monitor
– Cross platform
– Scalar / Vector / Tensor / Volume Visualization
– Streamlines
– Contours/Isosurface
– Auto Fill Geometry
– Additional Component Generation
– Geometry Extraction
– Animation
– Scripting
– Image / Movie support
– CGNS / Plot3D / VTK / and many more data format compatibility
AnuVi is Scientific Visualization Framework and tools for Simulation.
AnuVi
PILOT DAEPILOT DAEPILOT DAEPILOT DAE----GRID GRID GRID GRID
OfOfOfOf
DEPARTMENT OF ATOMIC DEPARTMENT OF ATOMIC DEPARTMENT OF ATOMIC DEPARTMENT OF ATOMIC ENERGY, INDIAENERGY, INDIAENERGY, INDIAENERGY, INDIA
• DAE units involved in collaborative activities with needs to share expertise and resources
• DAE-Grid project initiated by BARC to provide the grid infrastructure to meet the demanding computing needs of scientific researchers
� Enables organizations to share their hardware and software resources.
• Four major DAE institutes
� BARC-Mumbai
� RRCAT-Indore
� IGCAR-Kalpakkam
� VECC-Kolkata
• LCG/gLite is the grid middleware used. LCG-2.4 was the initial Grid Middleware and now using glite-3.1. Around 350 Processors are connected through DAE-Grid.
• Low network bandwidth due to high costs. Batch job submission applications.
The DAE-Grid
IGCAR: Data and S/W
Repositories
BARC:
Computing
4 Mbps Links
VECC: Scientific
Instruments
CAT: storage
Resource sharing and coordinated problem solving in dynamic, multiple R&D units
Uses WLCG tools
DAE-Grid Setup
UI
User Interface
SE
Storage Element
CE
Computing Element
WN
Worker NodeWN
Worker NodeWN
Worker NodeWN
Worker NodeWN
Worker NodeWN
Worker Node
Resource Broker + MyProxy Server + Top BDII
(Workload Management) (Proxy renewal) (Information System)
LFC
File Catalog
Interface for using the GRID
Certifying Authority
Certificates
VOMS
Virtual Organization Membership Server
Middleware Services
• services (central) deployed only at BARC � Certification Authority (CA)� Virtual Organization Membership
Service (VOMS)� Resource Broker (RB) + MyProxy
Server + Top BDII� LCG File Catalogue (LFC)� Monitoring & Accounting Server
• All sites deploy the site services namely
� Computing Element (one CE for every cluster)
� Worker Nodes� User Interface
Monitoring & Accounting Server
User Interface
User Interface
Worker Nodes
PBS client
Certificates
FMON Server
GridFTP, RFIO and FMON agent
Storage Element
Gatekeeper
Site BDII
GRIS
Information Providers
PBS
FMON agent
Computing Element
Site Services• Computing Element
� Gatekeeper Service (Accepts job requests from Grid)
� PBS (Cluster Resource Management System)
� GRIS, Site BDII (Information System)
� FMON (Monitoring Agents)• Worker Nodes
� PBS Client• Storage Element
� DPM Services� GridFTP, RFIO Services� FMON Server (Monitoring
Server)• User Interface
DAE Grid Site Services
Grid Portal
• One CA (LCG/gLite uses GSI) setup in BARC.
• The roles defined:
a) CA Administrator b) Registration Authority (RA)
c) Site Manager d) User
• Two servers namely offline CA and online CA
• Online CA provides a web interface for users to upload the CSR (Certificate Signing Request) and download the signed certificates.
• The offline CA is used for issuing certificates and revoking certificates and in generation of CRL (Certificate Revocation Lists).
• DAE-Grid have presently one VO “DAEGRID”.
DAE-Grid Certification Authority (CA)
• An in-house developed monitoring service that gives the complete state of the grid in a single page (Services, File System, PBS MOM alerts etc)
• Gives Status of the jobs in queue.• Job Records are collected and graphs generated by server from APEL (Accounting
Processor for Event Logs), which runs on every cluster
Monitoring & Accounting Server
• The RB service failure has the following impact� New jobs cannot be submitted� Status of existing jobs cannot be queried� Jobs, which have finished will not be shown as, completed until the RB
service has been recovered� Output data from jobs may be lost since they cannot copy the job
results to the output sandbox on the RB (the job retries for few times after waiting random periods for some time and then gives up)
High Availability RB
RB Master RB Standby
Switch
HA Status Packets
broker.barc.daegrid.gov.in
State
RRCATIndore, M.P
VECCKolkota, W.B
IGCARKalpakkam
BARCMumbai
Cluster
R B
Links not being used
Multiple Resource Brokers
• Current gLite Version has inherent support for using multiple resource brokersfor a single VO.
• The user job will be directed to the resource brokerthat is up and running at the time of submission.
• Currently queued jobs need to be shifted
� Resource Broker maintains the state in MySQL database and Condor-G maintains the queue
� Putting the jobs in the backup RB Condor-G queue is not possible.
� Instead, take the state of jobs and Sandbox Dir. in main RB and give to backup RB
� Backup RB copies the Sandbox Dir.’s and maintains the state of the jobs separately that were initially submitted to main RB
• DNS mapping need to be changed for main RB.
• Client (Backup RB) – Server (CE’s) method used by backup RB to regularly update the status of these Jobs
• Job Management Commands like job status, job canceletc need to take this effect automatically.
Issues in preparing the backup broker..
• Main RB comes up
� Get the state of all the jobs (shifted from this RB) from the Backup RB
� Update the MySQL state tables and Sandbox directories accordingly
• Remove the DNS mapping from the main RB
• This has been tested and is very useful in ensuring the smooth running of jobs in the event of a scheduled switching off of a Resource broker owing to A/C maintenance or due to scheduled Electrical Maintenance.
Issues in automating the above solution (we are currently working on this)
• Maintaining same Global User DN-> Local Usermapping across RB’s is difficult.
• Re-thinking needed to Completely automating this process (which is not there like DNS mapping).
Issues in preparing the backup broker..
RegionalRegionalRegionalRegional
DAEDAEDAEDAE----WLCG WLCG WLCG WLCG
TierTierTierTier----2 2 2 2
in Indiain Indiain Indiain India
TIFR : CMS Tier II
4 Mbps Links
VECC/SINP: ALICE Tier II
0.3/1 Gbps link to CERN/Geant
100 MbpsLink
Tier III 2 Mbps
IPLC
NLC
BARC, IOPB and 14 Universities have been operational since 2007
34 mbpsLink to Geant
Regional DAE-WLCG Tier-2 in India
Tier2: LCG - IndiaCMS-TIFR, Mumbai, India
Network
Network bandwidth recently upgraded to 1 Gbps
Link is tested successfully between TIFR and
CERN and commissioned. Full utilization is underway
Storage
Current: 50 TB (raw) disk space. HP EVA 8000 system
DPM using SRM is used to provide storage services
Running Services
Bandwidth Status
Service Processor RAMVOBOX Dual Intel Xeon 2.4 Ghz 4GBALIENLFC
CE Dual Intel Xeon 3.0 GHz 4GBSE Dual Intel Xeon 3.0 GHz 4GB13 WN Dual Intel Xeon 3.0 GHz 4GB
100 Mbps is running fine since 14/01/2008 and Upgraded soon to 155 Mbps.
ALICE TIER-2 Centre at KOLKATA
Participated in development & deployment of Tools in LCG
LEMON Architecture
GRIDVIEW
QUATTOR Architecture
SHIVA
CC Tracker
GridView Architecture
GRIDVIEW: VO-Wise Data Transfer
GridView Screen
GRIDVIEW: Job Status
DAE participationDAE participationDAE participationDAE participation
in in in in
EUEUEUEU----India GridIndia GridIndia GridIndia Grid
• Quickly set up Grid Infrastructure– Use production Grid-WLC and connect Indian Grids– Interoperate with EGEE Euro-Grids– Contribute to Grid standardisation efforts
• Support applications from diverse communities– High Energy Physics…………. DAE units– Condensed Matter Physics… Pune Univ,
TIFR, BARC– Bio-Sciences……………………… NCBS– Earth Sciences …………………. CDAC-Pune– Pilot clusters to users…….... INFN & VECC
• Business– E-governance interested business partners… NIC
+ Disseminate knowledge about the Grid through training
EU-India Grid Objectives
• CLOSE INVOLVEMENT IN PROJECT PROCESSES
• COLLABORATION WITH CERN FOR LCG
• DEVELOPMENT IN GRID COMPUTING WITH CONFORMANCE TO gLite
• LEADING ROLE IN EUROPE-INDIA CONNECTIVITY
• COORDINATING WITH GOVERNMENT OF INDIAAGENCIES IN DECISION MAKING PROCESS
DAE AS A PARTNER IN EUDAE AS A PARTNER IN EUDAE AS A PARTNER IN EUDAE AS A PARTNER IN EU----INDIAGRIDINDIAGRIDINDIAGRIDINDIAGRID
• PARTNER IN INDIA’S NATIONAL GRID “GARUDA”
To SUMMARISE
• DAE HAS BEEN ACTIVE IN GRID COMPUTING
SINCE 2004
• A PILOT DAE GRID IS OPERATIONAL AND IS
BEING ENHANCED
• PARTICIPATION IN WLCG HAS GENERATED MAN
POWER EXPERTISE: PRODUCTS DESIGNED,
DEVELOPED AND DEPLOYED AT LCG, CERN
• CLOSE PARTICIPATION IN EU-INDIAGRID with
Deputy Director from DAE
• IN FUTURE, WE EXPECT TO ACITIVELY
PARTICIPATE IN HIGH-END APPLICATIONS
DEVELOPMENT AND TAKING UP MIDDLEWARE
PROBLEMS
Acknowledgements
P. S. DHEKNE, DY. DIR, EU-INDIAGRID PROJECTEX-ASSOCIATE-DIRECTOR (E&IG), BARC
& THE DRIVE
R. S. MUNDADA, B. S. JAGADEESH, S. K. BOSE, K. RAJESH
TEAM LEADERS IN SUPERCOMPUTING, GRID COMPUTING & VISUALISATION
R. SHARMA, K BHATT, PHOOLCHAND, C.S.R.C. MURTHY, DINESH, SONAVANE,VAIBHAV, VINOD
YOUNG DEVELOPERS
LINKED SLIDES
LEMONarchitecture
Continue
Configuration Management Infrastructure
Node(Cluster) Management
Continue
QUATTORARCHITECTURE
SAMDB
GRIDVIEWDB
Service Nodes
SAM testsSAM TestResults
SAMFramework
PublishingWeb Service
R-GMAArchiver Module
Web ServiceArchiver Module
SAM XSQLExport Module
RBs SEs (gridftp)
WS Client
RB JobLogs
GridftpLogs
GridftpLogs
Fabric Monitoring System at Site (LEMON / Nagios)
HTTP/XMLAvailability Metrics
GOCDBGOCDB
Sync ModuleData Analysis &
Summarization Module
VisualizationModule
Graphs & Reports
Continue
GridView Architecture