CERN database services for the LHC computing grid - IOPscience

8
Journal of Physics: Conference Series OPEN ACCESS CERN database services for the LHC computing grid To cite this article: M Girone 2008 J. Phys.: Conf. Ser. 119 052017 View the article online for updates and enhancements. You may also like Towards a Global Service Registry for the World-Wide LHC Computing Grid Laurence Field, Maria Alandes Pradillo and Alessandro Di Girolamo - Experiment Dashboard for Monitoring of the LHC Distributed Computing Systems J Andreeva, M Devesas Campos, J Tarragon Cros et al. - Can Clouds replace Grids? Will Clouds replace Grids? J D Shiers - This content was downloaded from IP address 36.228.58.233 on 15/10/2021 at 21:53

Transcript of CERN database services for the LHC computing grid - IOPscience

Journal of Physics Conference Series

OPEN ACCESS

CERN database services for the LHC computinggridTo cite this article M Girone 2008 J Phys Conf Ser 119 052017

View the article online for updates and enhancements

You may also likeTowards a Global Service Registry for theWorld-Wide LHC Computing GridLaurence Field Maria Alandes Pradilloand Alessandro Di Girolamo

-

Experiment Dashboard for Monitoring ofthe LHC Distributed Computing SystemsJ Andreeva M Devesas Campos JTarragon Cros et al

-

Can Clouds replace Grids Will Cloudsreplace GridsJ D Shiers

-

This content was downloaded from IP address 3622858233 on 15102021 at 2153

CERN Database Services for the LHC Computing Grid

M Girone CERN IT Department CH-1211 Geneva 23 E-mail mariagironecernch Abstract Physics meta-data stored in relational databases play a crucial role in the Large Hadron Collider (LHC) experiments and also in the operation of the Worldwide LHC Computing Grid (WLCG) services A large proportion of non-event data such as detector conditions calibration geometry and production bookkeeping relies heavily on databases Also the core Grid services that catalogue and distribute LHC data cannot operate without a reliable database infrastructure at CERN and elsewhere The Physics Services and Support group at CERN provides database services for the physics community With an installed base of several TB-sized database clusters the service is designed to accommodate growth for data processing generated by the LHC experiments and LCG services During the last year the physics database services went through a major preparation phase for LHC start-up and are now fully based on Oracle clusters on IntelLinux Over 100 database server nodes are deployed today in some 15 clusters serving almost 2 million database sessions per week This paper will detail the architecture currently deployed in production and the results achieved in the areas of high availability consolidation and scalability Service evolution plans for the LHC start-up will also be discussed

1 Introduction Relational databases play a crucial role in the LHC experiments offline and online applications namely for detector conditions calibration geometry and production bookkeeping Also the core grid services for cataloguing monitoring and distributing LHC data cannot operate without a robust database infrastructure at CERN and elsewhere Relational databases are used at each level of the physics data handling process and are therefore of high importance for the smooth operation of the LHC experiments and the World LHC Computing Grid (WLCG) services The Physics Services and Support group at the CERN IT Department provides database services for the physics community [1]

The main challenges that the service is addressing are provide high availability for mission critical applications performance and scalability of several multi-terabyte applications contain hardware and database administration complexity and overall costs Moreover the CERN database services for physics are part of a distributed database network connecting several large scientific institutions (Tier1 sites) for applications such as conditions and file catalog which require synchronization between the master Tier0 and the read-only Tier1 databases Solid procedures for backup and recovery also need to be in place for such a service which is required for its importance to be operated with a 24 hours per day seven days per week (247) schedule

2 Service Architecture Oracle 10g Real Application Clusters (RAC) and Automatic Storage Manager (ASM) on Linux has been chosen as the main platform to deploy the database services for physics Each database cluster

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

ccopy 2008 IOP Publishing Ltd 1

Figure 1 Oracle 10g RAC and ASM allow to achieve high availability by building clusters with full redundancy and no single point of failure In addition clusters can be expanded to meet growth consists of several processing nodes accessing data shared in a storage area network (SAN) The cluster nodes use a dedicated network to share cached data blocks to minimize the number of disk operations A public network connects the database cluster to client applications which may execute queries in parallel on several nodes The setup provides important flexibility to expand the database server resources (CPU and storage independently) according to user needs This is particularly important during the early phases of LHC operation since several applications are still under development and data volume and access patterns may still change

In addition to its intrinsic scalability the cluster also significantly increases the database availability In case of individual node failures applications can failover to one of the remaining cluster nodes and many regular service intervention can be performed without database downtime node by node

Oracle ASM is a volume manager and specialized cluster file-system which allows the use of low-cost storage array to build scalable and highly available storage solution for Oracle 10g Oracle clusters are composed of a relatively large number of cluster nodes and storage elements with hardware homogeneous characteristics in a shared database storage (shared everything clustering technology)

A pool of servers storage arrays switches and network devices are used as lsquostandardrsquo building blocks This has the additional advantage of simplifying hardware provisioning and service growth At the same time an homogeneous software configuration is adopted for the operating system and database software Red Hat Enterprise Linux RHEL4 and Oracle 10g release 2 which simplifies the installation administration and troubleshooting This has allowed the service to quadruplicate its hardware resources within two years with the same database administrator team

The database architecture deployed at the CERN database services for physics team embodies the key ideas of grid computing as implemented by Oracle10g It uses the following key elements

bull Oracle 10g RAC on Linux to scale out Oracle workload (eg CPU power) bull Fiber Channel SAN network

o SAN multi-pathing for resilience and load balancing

Server

ss

SAN

Storage

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

2

o Storage can be configured online using SAN zoning bull Storage built with mid-range storage array with FC controllers and SATA HD bull Oracle ASM used as the volume manager and cluster file-system for Oracle

Storage striping and mirroring for performance and high availability (RAID 10)

A pictorial representation of a RAC cluster with ASM on Oracle 10g at CERNrsquos database service for physics can be found in figure 1 There are currently more than 110 dual process nodes more than 400 GB of RAM and 300TB of raw storage deployed by the database services for physics at CERN using this architecture Based on information presented at Oracle Special Interest Groups [2] and from feedback from customer visits to CERN we believe that this translates into one of the biggest Oracle database installations worldwide (Unfortunately very few sites make such information available)

3 Hardware configuration Oracle RAC clusters are typically deployed in production over 8 nodes Each node is a server with two x86 CPUs running 3GHz and with 4GB RAM Two redundant cluster interconnects are deployed with Gbps Ethernet Public networks are also on Gbps Ethernet The servers mount host bus adaptors that are dual-ported and connected to redundant SAN switches Access to the storage is configured via two (redundant) switch FC networks (2Gbps and 4 Gbps networks are used on different clusters) as shown in figure 2

The disk arrays contain lsquoconsumer-qualityrsquo SATA disks but have dual-ported Fiber Channel controllers Mirroring and striping are done at the software level with Oraclersquos (host based mirroring with ASM) This has the advantage to allow mirroring across two different storage arrays (and therefore storage controllers) Storage arrays are dual-ported where each port is connected to a different SAN switch The typical characteristics of the storage arrays used in production are 16 SATA disks with a raw capacity of 6 TB per array Storage is partitioned in two halves under Linux the outer (faster) half of the disk is used to build data disk group while the inner half is used to build the flash recovery area disk group Disk groups are created with ASM and used to allocate Oracle files data disk groups are used among others for data-files control-files and redo-logs

Figure 2 Schematic view of a 8-node RAC cluster Two redundant Gbps Ethernet switches are used for cluster interconnects Access to the storage is configured via two redundant fibre channel switches

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

3

4 Service levels and release cycles Three different levels of services are deployed to support Oracle database applications a development a validationpre-production and a production service Applications at different stages of their development move from the first to the last depending on their maturity and needs In addition the validationpre-production services are used also to test new Oracle server software releases and security update patches as described in section 6

This structure is in place since a few years and has proven to be very effective for assuring that only good quality applications be deployed in production with a clear benefit for the stability of the production services

The validationpre-production and the production services are deployed on Oracle 10g Real Application Clusters (RAC) on Linux Both the validation and production services are deployed on similar hardware with the same configuration OS and Oracle server versions Typically 2-node clusters are dedicated to the LHC experiments for validating their applications At the production level RAC clusters consisting of six or eight nodes are assigned

41 The development service The development service is a public service meant for pure software development It is characterized by a working hours only (85) monitoring and availability including backups and database administrator consultancy A limited database space is available Once the application code is stable enough and needs larger-scale tests it moves to the next service level as described below

42 The validationpre-production service The validation service is characterized by a two-node RAC dedicated to each LHC experiment and grid community 85 monitoring and availability DBA consultancy for optimization and quality-insurance of key experiment applications A time slot of a couple of weeks is granted to the developer to make full use of the cluster resources and to test herhis application under a realistic workload with the help of a database administrator from the team

On demand another two-node RAC cluster per experiment is set-up to host applications that need a longer phase of tests or which exceed the limits imposed on the shared development service in terms of data volume and workload We refer to this service as the pre-production service It should be noted that no backups are taken for the validationpre-production service

43 The production service The production service is characterized by 247 monitoring and availability backups as described in section 5 scheduled interventions for software upgrades as described in section 6 For deploying an application in production it is necessary to have validated it or define the use of readerwriter accountsroles the expected data volume number of simultaneous connections and total number of connections in the case it did not go through the validation procedure Best practices tips and guidelines are provided by the DBA team to help database developers In addition the experiments database coordinators receive a weekly report that summarizes the database activities in terms of sessions CPU physical and logical reads and writes

5 Monitoring and operations The databases for physics team deploys a mission critical service underlying many key services from the WLCG and from the LHC experiments Therefore a 247 production quality service is necessary Six database administrators are involved in this effort according to a weekly shift rotor on a ldquobest effortrdquo base All procedures are inline with the WLCG services ones [3]

The hardware is deployed at the IT computer centre the production database clusters are connected to critical power (UPS and diesel generators) The database service for physics at CERN rely on OS level monitoring and support from the computer centre operators the sys-administrators and network

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

4

teams At database level monitoring is achieved via Oracle Enterprise Manager and in-house developed scripts for ASM and oracle services This includes phone Short Messaging System (SMS) and email alerting system which allows a prompt reaction from the DBA team

6 Backup and recovery policies The database services for physics implement a backup strategy based on ORACLE RMAN 10g This strategy allows for backups to tape but also to disk (flash backup) thus reducing substantially the recovery time for many common recovery scenarios

The backup to tape uses the IBM Tivoli Storage Manager (TSM) Oracle RMAN has dedicated drivers to connect to TSM and manage backups stored in tapes in a special backup format There are different kinds of backups done to tape

bull Full backup (level 0) - a complete copy of the database and control files bull Differential backup (level 1) - copy of blocks changed since latest level 0 or level 1 backup bull Cumulative backup (level 1) - copy of blocks changed since latest level 0 backup bull Archive logs - copy of the archive logs (files containing the log of the operations done in the

database)

For the backup on disk Oracle RMAN copies all data files into the flash recovery area (typically on a different storage array than the data files) then all the subsequent backups are differential We use a technique called Incrementally Updated Backups to maintain this type of backup

The Oracles block change tracking feature is used to significantly reduce the latency and weight of the incremental DB backups (only changed DB blocks are read during a backup with this optimization)

The backup on tape retention policy is set to 31 days This guarantees that a copy of the database can be recovered in a time window of 31 days In addition we propose that a full backup is systematically performed and kept before any ORACLE software upgrade The schedule for the backup to tape is as follows

bull Full - every 2 weeks bull Incremental (differential or cumulative) ndash daily bull Archive logs - every 30 minutes

The backup on disk retention is set to 2 days and allows for database recoveries in this time frame The schedule for the backup on disk is as follows

bull Full - at database creation bull Incremental ndash daily

A dedicated system for recoveries is available The system allows for periodic test recoveries

point-in-time recoveries (from disk within a two days latency) and tape-backup based disaster recoveries At present a typical recovery takes 30MBsec assuming that a channel for tape reading is available In addition an overhead of a couple of hours is estimated for the DBA to understand and analyse the problem To give an idea the recovery of a 300GB database would take a total of 5 hours (3 for reading from tape plus 2 hours overhead)

7 Security and software updates policies Critical Patch Updates (CPU) are the primary means of releasing security fixes for Oracle products They are released on the Tuesday closest to the 15th day of January April July and October

The physics database services follow the practice of regularly applying those patches in a timely manner depending on the security vulnerability The proposed date for this planned intervention is announced and negotiated with the user community

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

5

Oracle recommends that each CPU is first deployed on a test system In order to achieve timely deployment on the production systems critical patch updates are applied typically within two weeks from the publishing date and after a validation period of one week on the validationpre-production services

Oracle software upgrades are typically performed once or twice per year The new version will be installed on the validation RAC and tested by the application owners and Tier 1 sites typically over a period of one month

8 Replication set-up for the 3D project The physics database services at CERN are heavily involved in the distributed database deployment (3D) project [4] In order to allow LHC data to flow through this distributed infrastructure an asynchronous replication technique (Oracle Streams) is used to form a database backbone between online and offline and between Tier-0 and the Tier-1 sites New or updated data from the online or offline databases systems are detected from database logs and then queued for transmission to all configured destination databases Only once data has been successfully applied at all destination databases it is removed from message queues at the source A dedicated downstream capture setup is used for replication via the wide area network to further insulate the source database in case of problems in connecting to Tier-1 replicas

At present all the 10 Tier1 sites involved in the project are in production as shown in figure 4 Currently the database service team at CERN and at the Tier1 sites provide a 85 coverage for interventions An archive-log retention of five days is set-up to cover any replication problem which may occur outside working hours

Figure 4 Streams dataflow and topology in the 3D streams monitor

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

6

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7

CERN Database Services for the LHC Computing Grid

M Girone CERN IT Department CH-1211 Geneva 23 E-mail mariagironecernch Abstract Physics meta-data stored in relational databases play a crucial role in the Large Hadron Collider (LHC) experiments and also in the operation of the Worldwide LHC Computing Grid (WLCG) services A large proportion of non-event data such as detector conditions calibration geometry and production bookkeeping relies heavily on databases Also the core Grid services that catalogue and distribute LHC data cannot operate without a reliable database infrastructure at CERN and elsewhere The Physics Services and Support group at CERN provides database services for the physics community With an installed base of several TB-sized database clusters the service is designed to accommodate growth for data processing generated by the LHC experiments and LCG services During the last year the physics database services went through a major preparation phase for LHC start-up and are now fully based on Oracle clusters on IntelLinux Over 100 database server nodes are deployed today in some 15 clusters serving almost 2 million database sessions per week This paper will detail the architecture currently deployed in production and the results achieved in the areas of high availability consolidation and scalability Service evolution plans for the LHC start-up will also be discussed

1 Introduction Relational databases play a crucial role in the LHC experiments offline and online applications namely for detector conditions calibration geometry and production bookkeeping Also the core grid services for cataloguing monitoring and distributing LHC data cannot operate without a robust database infrastructure at CERN and elsewhere Relational databases are used at each level of the physics data handling process and are therefore of high importance for the smooth operation of the LHC experiments and the World LHC Computing Grid (WLCG) services The Physics Services and Support group at the CERN IT Department provides database services for the physics community [1]

The main challenges that the service is addressing are provide high availability for mission critical applications performance and scalability of several multi-terabyte applications contain hardware and database administration complexity and overall costs Moreover the CERN database services for physics are part of a distributed database network connecting several large scientific institutions (Tier1 sites) for applications such as conditions and file catalog which require synchronization between the master Tier0 and the read-only Tier1 databases Solid procedures for backup and recovery also need to be in place for such a service which is required for its importance to be operated with a 24 hours per day seven days per week (247) schedule

2 Service Architecture Oracle 10g Real Application Clusters (RAC) and Automatic Storage Manager (ASM) on Linux has been chosen as the main platform to deploy the database services for physics Each database cluster

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

ccopy 2008 IOP Publishing Ltd 1

Figure 1 Oracle 10g RAC and ASM allow to achieve high availability by building clusters with full redundancy and no single point of failure In addition clusters can be expanded to meet growth consists of several processing nodes accessing data shared in a storage area network (SAN) The cluster nodes use a dedicated network to share cached data blocks to minimize the number of disk operations A public network connects the database cluster to client applications which may execute queries in parallel on several nodes The setup provides important flexibility to expand the database server resources (CPU and storage independently) according to user needs This is particularly important during the early phases of LHC operation since several applications are still under development and data volume and access patterns may still change

In addition to its intrinsic scalability the cluster also significantly increases the database availability In case of individual node failures applications can failover to one of the remaining cluster nodes and many regular service intervention can be performed without database downtime node by node

Oracle ASM is a volume manager and specialized cluster file-system which allows the use of low-cost storage array to build scalable and highly available storage solution for Oracle 10g Oracle clusters are composed of a relatively large number of cluster nodes and storage elements with hardware homogeneous characteristics in a shared database storage (shared everything clustering technology)

A pool of servers storage arrays switches and network devices are used as lsquostandardrsquo building blocks This has the additional advantage of simplifying hardware provisioning and service growth At the same time an homogeneous software configuration is adopted for the operating system and database software Red Hat Enterprise Linux RHEL4 and Oracle 10g release 2 which simplifies the installation administration and troubleshooting This has allowed the service to quadruplicate its hardware resources within two years with the same database administrator team

The database architecture deployed at the CERN database services for physics team embodies the key ideas of grid computing as implemented by Oracle10g It uses the following key elements

bull Oracle 10g RAC on Linux to scale out Oracle workload (eg CPU power) bull Fiber Channel SAN network

o SAN multi-pathing for resilience and load balancing

Server

ss

SAN

Storage

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

2

o Storage can be configured online using SAN zoning bull Storage built with mid-range storage array with FC controllers and SATA HD bull Oracle ASM used as the volume manager and cluster file-system for Oracle

Storage striping and mirroring for performance and high availability (RAID 10)

A pictorial representation of a RAC cluster with ASM on Oracle 10g at CERNrsquos database service for physics can be found in figure 1 There are currently more than 110 dual process nodes more than 400 GB of RAM and 300TB of raw storage deployed by the database services for physics at CERN using this architecture Based on information presented at Oracle Special Interest Groups [2] and from feedback from customer visits to CERN we believe that this translates into one of the biggest Oracle database installations worldwide (Unfortunately very few sites make such information available)

3 Hardware configuration Oracle RAC clusters are typically deployed in production over 8 nodes Each node is a server with two x86 CPUs running 3GHz and with 4GB RAM Two redundant cluster interconnects are deployed with Gbps Ethernet Public networks are also on Gbps Ethernet The servers mount host bus adaptors that are dual-ported and connected to redundant SAN switches Access to the storage is configured via two (redundant) switch FC networks (2Gbps and 4 Gbps networks are used on different clusters) as shown in figure 2

The disk arrays contain lsquoconsumer-qualityrsquo SATA disks but have dual-ported Fiber Channel controllers Mirroring and striping are done at the software level with Oraclersquos (host based mirroring with ASM) This has the advantage to allow mirroring across two different storage arrays (and therefore storage controllers) Storage arrays are dual-ported where each port is connected to a different SAN switch The typical characteristics of the storage arrays used in production are 16 SATA disks with a raw capacity of 6 TB per array Storage is partitioned in two halves under Linux the outer (faster) half of the disk is used to build data disk group while the inner half is used to build the flash recovery area disk group Disk groups are created with ASM and used to allocate Oracle files data disk groups are used among others for data-files control-files and redo-logs

Figure 2 Schematic view of a 8-node RAC cluster Two redundant Gbps Ethernet switches are used for cluster interconnects Access to the storage is configured via two redundant fibre channel switches

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

3

4 Service levels and release cycles Three different levels of services are deployed to support Oracle database applications a development a validationpre-production and a production service Applications at different stages of their development move from the first to the last depending on their maturity and needs In addition the validationpre-production services are used also to test new Oracle server software releases and security update patches as described in section 6

This structure is in place since a few years and has proven to be very effective for assuring that only good quality applications be deployed in production with a clear benefit for the stability of the production services

The validationpre-production and the production services are deployed on Oracle 10g Real Application Clusters (RAC) on Linux Both the validation and production services are deployed on similar hardware with the same configuration OS and Oracle server versions Typically 2-node clusters are dedicated to the LHC experiments for validating their applications At the production level RAC clusters consisting of six or eight nodes are assigned

41 The development service The development service is a public service meant for pure software development It is characterized by a working hours only (85) monitoring and availability including backups and database administrator consultancy A limited database space is available Once the application code is stable enough and needs larger-scale tests it moves to the next service level as described below

42 The validationpre-production service The validation service is characterized by a two-node RAC dedicated to each LHC experiment and grid community 85 monitoring and availability DBA consultancy for optimization and quality-insurance of key experiment applications A time slot of a couple of weeks is granted to the developer to make full use of the cluster resources and to test herhis application under a realistic workload with the help of a database administrator from the team

On demand another two-node RAC cluster per experiment is set-up to host applications that need a longer phase of tests or which exceed the limits imposed on the shared development service in terms of data volume and workload We refer to this service as the pre-production service It should be noted that no backups are taken for the validationpre-production service

43 The production service The production service is characterized by 247 monitoring and availability backups as described in section 5 scheduled interventions for software upgrades as described in section 6 For deploying an application in production it is necessary to have validated it or define the use of readerwriter accountsroles the expected data volume number of simultaneous connections and total number of connections in the case it did not go through the validation procedure Best practices tips and guidelines are provided by the DBA team to help database developers In addition the experiments database coordinators receive a weekly report that summarizes the database activities in terms of sessions CPU physical and logical reads and writes

5 Monitoring and operations The databases for physics team deploys a mission critical service underlying many key services from the WLCG and from the LHC experiments Therefore a 247 production quality service is necessary Six database administrators are involved in this effort according to a weekly shift rotor on a ldquobest effortrdquo base All procedures are inline with the WLCG services ones [3]

The hardware is deployed at the IT computer centre the production database clusters are connected to critical power (UPS and diesel generators) The database service for physics at CERN rely on OS level monitoring and support from the computer centre operators the sys-administrators and network

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

4

teams At database level monitoring is achieved via Oracle Enterprise Manager and in-house developed scripts for ASM and oracle services This includes phone Short Messaging System (SMS) and email alerting system which allows a prompt reaction from the DBA team

6 Backup and recovery policies The database services for physics implement a backup strategy based on ORACLE RMAN 10g This strategy allows for backups to tape but also to disk (flash backup) thus reducing substantially the recovery time for many common recovery scenarios

The backup to tape uses the IBM Tivoli Storage Manager (TSM) Oracle RMAN has dedicated drivers to connect to TSM and manage backups stored in tapes in a special backup format There are different kinds of backups done to tape

bull Full backup (level 0) - a complete copy of the database and control files bull Differential backup (level 1) - copy of blocks changed since latest level 0 or level 1 backup bull Cumulative backup (level 1) - copy of blocks changed since latest level 0 backup bull Archive logs - copy of the archive logs (files containing the log of the operations done in the

database)

For the backup on disk Oracle RMAN copies all data files into the flash recovery area (typically on a different storage array than the data files) then all the subsequent backups are differential We use a technique called Incrementally Updated Backups to maintain this type of backup

The Oracles block change tracking feature is used to significantly reduce the latency and weight of the incremental DB backups (only changed DB blocks are read during a backup with this optimization)

The backup on tape retention policy is set to 31 days This guarantees that a copy of the database can be recovered in a time window of 31 days In addition we propose that a full backup is systematically performed and kept before any ORACLE software upgrade The schedule for the backup to tape is as follows

bull Full - every 2 weeks bull Incremental (differential or cumulative) ndash daily bull Archive logs - every 30 minutes

The backup on disk retention is set to 2 days and allows for database recoveries in this time frame The schedule for the backup on disk is as follows

bull Full - at database creation bull Incremental ndash daily

A dedicated system for recoveries is available The system allows for periodic test recoveries

point-in-time recoveries (from disk within a two days latency) and tape-backup based disaster recoveries At present a typical recovery takes 30MBsec assuming that a channel for tape reading is available In addition an overhead of a couple of hours is estimated for the DBA to understand and analyse the problem To give an idea the recovery of a 300GB database would take a total of 5 hours (3 for reading from tape plus 2 hours overhead)

7 Security and software updates policies Critical Patch Updates (CPU) are the primary means of releasing security fixes for Oracle products They are released on the Tuesday closest to the 15th day of January April July and October

The physics database services follow the practice of regularly applying those patches in a timely manner depending on the security vulnerability The proposed date for this planned intervention is announced and negotiated with the user community

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

5

Oracle recommends that each CPU is first deployed on a test system In order to achieve timely deployment on the production systems critical patch updates are applied typically within two weeks from the publishing date and after a validation period of one week on the validationpre-production services

Oracle software upgrades are typically performed once or twice per year The new version will be installed on the validation RAC and tested by the application owners and Tier 1 sites typically over a period of one month

8 Replication set-up for the 3D project The physics database services at CERN are heavily involved in the distributed database deployment (3D) project [4] In order to allow LHC data to flow through this distributed infrastructure an asynchronous replication technique (Oracle Streams) is used to form a database backbone between online and offline and between Tier-0 and the Tier-1 sites New or updated data from the online or offline databases systems are detected from database logs and then queued for transmission to all configured destination databases Only once data has been successfully applied at all destination databases it is removed from message queues at the source A dedicated downstream capture setup is used for replication via the wide area network to further insulate the source database in case of problems in connecting to Tier-1 replicas

At present all the 10 Tier1 sites involved in the project are in production as shown in figure 4 Currently the database service team at CERN and at the Tier1 sites provide a 85 coverage for interventions An archive-log retention of five days is set-up to cover any replication problem which may occur outside working hours

Figure 4 Streams dataflow and topology in the 3D streams monitor

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

6

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7

Figure 1 Oracle 10g RAC and ASM allow to achieve high availability by building clusters with full redundancy and no single point of failure In addition clusters can be expanded to meet growth consists of several processing nodes accessing data shared in a storage area network (SAN) The cluster nodes use a dedicated network to share cached data blocks to minimize the number of disk operations A public network connects the database cluster to client applications which may execute queries in parallel on several nodes The setup provides important flexibility to expand the database server resources (CPU and storage independently) according to user needs This is particularly important during the early phases of LHC operation since several applications are still under development and data volume and access patterns may still change

In addition to its intrinsic scalability the cluster also significantly increases the database availability In case of individual node failures applications can failover to one of the remaining cluster nodes and many regular service intervention can be performed without database downtime node by node

Oracle ASM is a volume manager and specialized cluster file-system which allows the use of low-cost storage array to build scalable and highly available storage solution for Oracle 10g Oracle clusters are composed of a relatively large number of cluster nodes and storage elements with hardware homogeneous characteristics in a shared database storage (shared everything clustering technology)

A pool of servers storage arrays switches and network devices are used as lsquostandardrsquo building blocks This has the additional advantage of simplifying hardware provisioning and service growth At the same time an homogeneous software configuration is adopted for the operating system and database software Red Hat Enterprise Linux RHEL4 and Oracle 10g release 2 which simplifies the installation administration and troubleshooting This has allowed the service to quadruplicate its hardware resources within two years with the same database administrator team

The database architecture deployed at the CERN database services for physics team embodies the key ideas of grid computing as implemented by Oracle10g It uses the following key elements

bull Oracle 10g RAC on Linux to scale out Oracle workload (eg CPU power) bull Fiber Channel SAN network

o SAN multi-pathing for resilience and load balancing

Server

ss

SAN

Storage

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

2

o Storage can be configured online using SAN zoning bull Storage built with mid-range storage array with FC controllers and SATA HD bull Oracle ASM used as the volume manager and cluster file-system for Oracle

Storage striping and mirroring for performance and high availability (RAID 10)

A pictorial representation of a RAC cluster with ASM on Oracle 10g at CERNrsquos database service for physics can be found in figure 1 There are currently more than 110 dual process nodes more than 400 GB of RAM and 300TB of raw storage deployed by the database services for physics at CERN using this architecture Based on information presented at Oracle Special Interest Groups [2] and from feedback from customer visits to CERN we believe that this translates into one of the biggest Oracle database installations worldwide (Unfortunately very few sites make such information available)

3 Hardware configuration Oracle RAC clusters are typically deployed in production over 8 nodes Each node is a server with two x86 CPUs running 3GHz and with 4GB RAM Two redundant cluster interconnects are deployed with Gbps Ethernet Public networks are also on Gbps Ethernet The servers mount host bus adaptors that are dual-ported and connected to redundant SAN switches Access to the storage is configured via two (redundant) switch FC networks (2Gbps and 4 Gbps networks are used on different clusters) as shown in figure 2

The disk arrays contain lsquoconsumer-qualityrsquo SATA disks but have dual-ported Fiber Channel controllers Mirroring and striping are done at the software level with Oraclersquos (host based mirroring with ASM) This has the advantage to allow mirroring across two different storage arrays (and therefore storage controllers) Storage arrays are dual-ported where each port is connected to a different SAN switch The typical characteristics of the storage arrays used in production are 16 SATA disks with a raw capacity of 6 TB per array Storage is partitioned in two halves under Linux the outer (faster) half of the disk is used to build data disk group while the inner half is used to build the flash recovery area disk group Disk groups are created with ASM and used to allocate Oracle files data disk groups are used among others for data-files control-files and redo-logs

Figure 2 Schematic view of a 8-node RAC cluster Two redundant Gbps Ethernet switches are used for cluster interconnects Access to the storage is configured via two redundant fibre channel switches

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

3

4 Service levels and release cycles Three different levels of services are deployed to support Oracle database applications a development a validationpre-production and a production service Applications at different stages of their development move from the first to the last depending on their maturity and needs In addition the validationpre-production services are used also to test new Oracle server software releases and security update patches as described in section 6

This structure is in place since a few years and has proven to be very effective for assuring that only good quality applications be deployed in production with a clear benefit for the stability of the production services

The validationpre-production and the production services are deployed on Oracle 10g Real Application Clusters (RAC) on Linux Both the validation and production services are deployed on similar hardware with the same configuration OS and Oracle server versions Typically 2-node clusters are dedicated to the LHC experiments for validating their applications At the production level RAC clusters consisting of six or eight nodes are assigned

41 The development service The development service is a public service meant for pure software development It is characterized by a working hours only (85) monitoring and availability including backups and database administrator consultancy A limited database space is available Once the application code is stable enough and needs larger-scale tests it moves to the next service level as described below

42 The validationpre-production service The validation service is characterized by a two-node RAC dedicated to each LHC experiment and grid community 85 monitoring and availability DBA consultancy for optimization and quality-insurance of key experiment applications A time slot of a couple of weeks is granted to the developer to make full use of the cluster resources and to test herhis application under a realistic workload with the help of a database administrator from the team

On demand another two-node RAC cluster per experiment is set-up to host applications that need a longer phase of tests or which exceed the limits imposed on the shared development service in terms of data volume and workload We refer to this service as the pre-production service It should be noted that no backups are taken for the validationpre-production service

43 The production service The production service is characterized by 247 monitoring and availability backups as described in section 5 scheduled interventions for software upgrades as described in section 6 For deploying an application in production it is necessary to have validated it or define the use of readerwriter accountsroles the expected data volume number of simultaneous connections and total number of connections in the case it did not go through the validation procedure Best practices tips and guidelines are provided by the DBA team to help database developers In addition the experiments database coordinators receive a weekly report that summarizes the database activities in terms of sessions CPU physical and logical reads and writes

5 Monitoring and operations The databases for physics team deploys a mission critical service underlying many key services from the WLCG and from the LHC experiments Therefore a 247 production quality service is necessary Six database administrators are involved in this effort according to a weekly shift rotor on a ldquobest effortrdquo base All procedures are inline with the WLCG services ones [3]

The hardware is deployed at the IT computer centre the production database clusters are connected to critical power (UPS and diesel generators) The database service for physics at CERN rely on OS level monitoring and support from the computer centre operators the sys-administrators and network

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

4

teams At database level monitoring is achieved via Oracle Enterprise Manager and in-house developed scripts for ASM and oracle services This includes phone Short Messaging System (SMS) and email alerting system which allows a prompt reaction from the DBA team

6 Backup and recovery policies The database services for physics implement a backup strategy based on ORACLE RMAN 10g This strategy allows for backups to tape but also to disk (flash backup) thus reducing substantially the recovery time for many common recovery scenarios

The backup to tape uses the IBM Tivoli Storage Manager (TSM) Oracle RMAN has dedicated drivers to connect to TSM and manage backups stored in tapes in a special backup format There are different kinds of backups done to tape

bull Full backup (level 0) - a complete copy of the database and control files bull Differential backup (level 1) - copy of blocks changed since latest level 0 or level 1 backup bull Cumulative backup (level 1) - copy of blocks changed since latest level 0 backup bull Archive logs - copy of the archive logs (files containing the log of the operations done in the

database)

For the backup on disk Oracle RMAN copies all data files into the flash recovery area (typically on a different storage array than the data files) then all the subsequent backups are differential We use a technique called Incrementally Updated Backups to maintain this type of backup

The Oracles block change tracking feature is used to significantly reduce the latency and weight of the incremental DB backups (only changed DB blocks are read during a backup with this optimization)

The backup on tape retention policy is set to 31 days This guarantees that a copy of the database can be recovered in a time window of 31 days In addition we propose that a full backup is systematically performed and kept before any ORACLE software upgrade The schedule for the backup to tape is as follows

bull Full - every 2 weeks bull Incremental (differential or cumulative) ndash daily bull Archive logs - every 30 minutes

The backup on disk retention is set to 2 days and allows for database recoveries in this time frame The schedule for the backup on disk is as follows

bull Full - at database creation bull Incremental ndash daily

A dedicated system for recoveries is available The system allows for periodic test recoveries

point-in-time recoveries (from disk within a two days latency) and tape-backup based disaster recoveries At present a typical recovery takes 30MBsec assuming that a channel for tape reading is available In addition an overhead of a couple of hours is estimated for the DBA to understand and analyse the problem To give an idea the recovery of a 300GB database would take a total of 5 hours (3 for reading from tape plus 2 hours overhead)

7 Security and software updates policies Critical Patch Updates (CPU) are the primary means of releasing security fixes for Oracle products They are released on the Tuesday closest to the 15th day of January April July and October

The physics database services follow the practice of regularly applying those patches in a timely manner depending on the security vulnerability The proposed date for this planned intervention is announced and negotiated with the user community

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

5

Oracle recommends that each CPU is first deployed on a test system In order to achieve timely deployment on the production systems critical patch updates are applied typically within two weeks from the publishing date and after a validation period of one week on the validationpre-production services

Oracle software upgrades are typically performed once or twice per year The new version will be installed on the validation RAC and tested by the application owners and Tier 1 sites typically over a period of one month

8 Replication set-up for the 3D project The physics database services at CERN are heavily involved in the distributed database deployment (3D) project [4] In order to allow LHC data to flow through this distributed infrastructure an asynchronous replication technique (Oracle Streams) is used to form a database backbone between online and offline and between Tier-0 and the Tier-1 sites New or updated data from the online or offline databases systems are detected from database logs and then queued for transmission to all configured destination databases Only once data has been successfully applied at all destination databases it is removed from message queues at the source A dedicated downstream capture setup is used for replication via the wide area network to further insulate the source database in case of problems in connecting to Tier-1 replicas

At present all the 10 Tier1 sites involved in the project are in production as shown in figure 4 Currently the database service team at CERN and at the Tier1 sites provide a 85 coverage for interventions An archive-log retention of five days is set-up to cover any replication problem which may occur outside working hours

Figure 4 Streams dataflow and topology in the 3D streams monitor

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

6

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7

o Storage can be configured online using SAN zoning bull Storage built with mid-range storage array with FC controllers and SATA HD bull Oracle ASM used as the volume manager and cluster file-system for Oracle

Storage striping and mirroring for performance and high availability (RAID 10)

A pictorial representation of a RAC cluster with ASM on Oracle 10g at CERNrsquos database service for physics can be found in figure 1 There are currently more than 110 dual process nodes more than 400 GB of RAM and 300TB of raw storage deployed by the database services for physics at CERN using this architecture Based on information presented at Oracle Special Interest Groups [2] and from feedback from customer visits to CERN we believe that this translates into one of the biggest Oracle database installations worldwide (Unfortunately very few sites make such information available)

3 Hardware configuration Oracle RAC clusters are typically deployed in production over 8 nodes Each node is a server with two x86 CPUs running 3GHz and with 4GB RAM Two redundant cluster interconnects are deployed with Gbps Ethernet Public networks are also on Gbps Ethernet The servers mount host bus adaptors that are dual-ported and connected to redundant SAN switches Access to the storage is configured via two (redundant) switch FC networks (2Gbps and 4 Gbps networks are used on different clusters) as shown in figure 2

The disk arrays contain lsquoconsumer-qualityrsquo SATA disks but have dual-ported Fiber Channel controllers Mirroring and striping are done at the software level with Oraclersquos (host based mirroring with ASM) This has the advantage to allow mirroring across two different storage arrays (and therefore storage controllers) Storage arrays are dual-ported where each port is connected to a different SAN switch The typical characteristics of the storage arrays used in production are 16 SATA disks with a raw capacity of 6 TB per array Storage is partitioned in two halves under Linux the outer (faster) half of the disk is used to build data disk group while the inner half is used to build the flash recovery area disk group Disk groups are created with ASM and used to allocate Oracle files data disk groups are used among others for data-files control-files and redo-logs

Figure 2 Schematic view of a 8-node RAC cluster Two redundant Gbps Ethernet switches are used for cluster interconnects Access to the storage is configured via two redundant fibre channel switches

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

3

4 Service levels and release cycles Three different levels of services are deployed to support Oracle database applications a development a validationpre-production and a production service Applications at different stages of their development move from the first to the last depending on their maturity and needs In addition the validationpre-production services are used also to test new Oracle server software releases and security update patches as described in section 6

This structure is in place since a few years and has proven to be very effective for assuring that only good quality applications be deployed in production with a clear benefit for the stability of the production services

The validationpre-production and the production services are deployed on Oracle 10g Real Application Clusters (RAC) on Linux Both the validation and production services are deployed on similar hardware with the same configuration OS and Oracle server versions Typically 2-node clusters are dedicated to the LHC experiments for validating their applications At the production level RAC clusters consisting of six or eight nodes are assigned

41 The development service The development service is a public service meant for pure software development It is characterized by a working hours only (85) monitoring and availability including backups and database administrator consultancy A limited database space is available Once the application code is stable enough and needs larger-scale tests it moves to the next service level as described below

42 The validationpre-production service The validation service is characterized by a two-node RAC dedicated to each LHC experiment and grid community 85 monitoring and availability DBA consultancy for optimization and quality-insurance of key experiment applications A time slot of a couple of weeks is granted to the developer to make full use of the cluster resources and to test herhis application under a realistic workload with the help of a database administrator from the team

On demand another two-node RAC cluster per experiment is set-up to host applications that need a longer phase of tests or which exceed the limits imposed on the shared development service in terms of data volume and workload We refer to this service as the pre-production service It should be noted that no backups are taken for the validationpre-production service

43 The production service The production service is characterized by 247 monitoring and availability backups as described in section 5 scheduled interventions for software upgrades as described in section 6 For deploying an application in production it is necessary to have validated it or define the use of readerwriter accountsroles the expected data volume number of simultaneous connections and total number of connections in the case it did not go through the validation procedure Best practices tips and guidelines are provided by the DBA team to help database developers In addition the experiments database coordinators receive a weekly report that summarizes the database activities in terms of sessions CPU physical and logical reads and writes

5 Monitoring and operations The databases for physics team deploys a mission critical service underlying many key services from the WLCG and from the LHC experiments Therefore a 247 production quality service is necessary Six database administrators are involved in this effort according to a weekly shift rotor on a ldquobest effortrdquo base All procedures are inline with the WLCG services ones [3]

The hardware is deployed at the IT computer centre the production database clusters are connected to critical power (UPS and diesel generators) The database service for physics at CERN rely on OS level monitoring and support from the computer centre operators the sys-administrators and network

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

4

teams At database level monitoring is achieved via Oracle Enterprise Manager and in-house developed scripts for ASM and oracle services This includes phone Short Messaging System (SMS) and email alerting system which allows a prompt reaction from the DBA team

6 Backup and recovery policies The database services for physics implement a backup strategy based on ORACLE RMAN 10g This strategy allows for backups to tape but also to disk (flash backup) thus reducing substantially the recovery time for many common recovery scenarios

The backup to tape uses the IBM Tivoli Storage Manager (TSM) Oracle RMAN has dedicated drivers to connect to TSM and manage backups stored in tapes in a special backup format There are different kinds of backups done to tape

bull Full backup (level 0) - a complete copy of the database and control files bull Differential backup (level 1) - copy of blocks changed since latest level 0 or level 1 backup bull Cumulative backup (level 1) - copy of blocks changed since latest level 0 backup bull Archive logs - copy of the archive logs (files containing the log of the operations done in the

database)

For the backup on disk Oracle RMAN copies all data files into the flash recovery area (typically on a different storage array than the data files) then all the subsequent backups are differential We use a technique called Incrementally Updated Backups to maintain this type of backup

The Oracles block change tracking feature is used to significantly reduce the latency and weight of the incremental DB backups (only changed DB blocks are read during a backup with this optimization)

The backup on tape retention policy is set to 31 days This guarantees that a copy of the database can be recovered in a time window of 31 days In addition we propose that a full backup is systematically performed and kept before any ORACLE software upgrade The schedule for the backup to tape is as follows

bull Full - every 2 weeks bull Incremental (differential or cumulative) ndash daily bull Archive logs - every 30 minutes

The backup on disk retention is set to 2 days and allows for database recoveries in this time frame The schedule for the backup on disk is as follows

bull Full - at database creation bull Incremental ndash daily

A dedicated system for recoveries is available The system allows for periodic test recoveries

point-in-time recoveries (from disk within a two days latency) and tape-backup based disaster recoveries At present a typical recovery takes 30MBsec assuming that a channel for tape reading is available In addition an overhead of a couple of hours is estimated for the DBA to understand and analyse the problem To give an idea the recovery of a 300GB database would take a total of 5 hours (3 for reading from tape plus 2 hours overhead)

7 Security and software updates policies Critical Patch Updates (CPU) are the primary means of releasing security fixes for Oracle products They are released on the Tuesday closest to the 15th day of January April July and October

The physics database services follow the practice of regularly applying those patches in a timely manner depending on the security vulnerability The proposed date for this planned intervention is announced and negotiated with the user community

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

5

Oracle recommends that each CPU is first deployed on a test system In order to achieve timely deployment on the production systems critical patch updates are applied typically within two weeks from the publishing date and after a validation period of one week on the validationpre-production services

Oracle software upgrades are typically performed once or twice per year The new version will be installed on the validation RAC and tested by the application owners and Tier 1 sites typically over a period of one month

8 Replication set-up for the 3D project The physics database services at CERN are heavily involved in the distributed database deployment (3D) project [4] In order to allow LHC data to flow through this distributed infrastructure an asynchronous replication technique (Oracle Streams) is used to form a database backbone between online and offline and between Tier-0 and the Tier-1 sites New or updated data from the online or offline databases systems are detected from database logs and then queued for transmission to all configured destination databases Only once data has been successfully applied at all destination databases it is removed from message queues at the source A dedicated downstream capture setup is used for replication via the wide area network to further insulate the source database in case of problems in connecting to Tier-1 replicas

At present all the 10 Tier1 sites involved in the project are in production as shown in figure 4 Currently the database service team at CERN and at the Tier1 sites provide a 85 coverage for interventions An archive-log retention of five days is set-up to cover any replication problem which may occur outside working hours

Figure 4 Streams dataflow and topology in the 3D streams monitor

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

6

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7

4 Service levels and release cycles Three different levels of services are deployed to support Oracle database applications a development a validationpre-production and a production service Applications at different stages of their development move from the first to the last depending on their maturity and needs In addition the validationpre-production services are used also to test new Oracle server software releases and security update patches as described in section 6

This structure is in place since a few years and has proven to be very effective for assuring that only good quality applications be deployed in production with a clear benefit for the stability of the production services

The validationpre-production and the production services are deployed on Oracle 10g Real Application Clusters (RAC) on Linux Both the validation and production services are deployed on similar hardware with the same configuration OS and Oracle server versions Typically 2-node clusters are dedicated to the LHC experiments for validating their applications At the production level RAC clusters consisting of six or eight nodes are assigned

41 The development service The development service is a public service meant for pure software development It is characterized by a working hours only (85) monitoring and availability including backups and database administrator consultancy A limited database space is available Once the application code is stable enough and needs larger-scale tests it moves to the next service level as described below

42 The validationpre-production service The validation service is characterized by a two-node RAC dedicated to each LHC experiment and grid community 85 monitoring and availability DBA consultancy for optimization and quality-insurance of key experiment applications A time slot of a couple of weeks is granted to the developer to make full use of the cluster resources and to test herhis application under a realistic workload with the help of a database administrator from the team

On demand another two-node RAC cluster per experiment is set-up to host applications that need a longer phase of tests or which exceed the limits imposed on the shared development service in terms of data volume and workload We refer to this service as the pre-production service It should be noted that no backups are taken for the validationpre-production service

43 The production service The production service is characterized by 247 monitoring and availability backups as described in section 5 scheduled interventions for software upgrades as described in section 6 For deploying an application in production it is necessary to have validated it or define the use of readerwriter accountsroles the expected data volume number of simultaneous connections and total number of connections in the case it did not go through the validation procedure Best practices tips and guidelines are provided by the DBA team to help database developers In addition the experiments database coordinators receive a weekly report that summarizes the database activities in terms of sessions CPU physical and logical reads and writes

5 Monitoring and operations The databases for physics team deploys a mission critical service underlying many key services from the WLCG and from the LHC experiments Therefore a 247 production quality service is necessary Six database administrators are involved in this effort according to a weekly shift rotor on a ldquobest effortrdquo base All procedures are inline with the WLCG services ones [3]

The hardware is deployed at the IT computer centre the production database clusters are connected to critical power (UPS and diesel generators) The database service for physics at CERN rely on OS level monitoring and support from the computer centre operators the sys-administrators and network

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

4

teams At database level monitoring is achieved via Oracle Enterprise Manager and in-house developed scripts for ASM and oracle services This includes phone Short Messaging System (SMS) and email alerting system which allows a prompt reaction from the DBA team

6 Backup and recovery policies The database services for physics implement a backup strategy based on ORACLE RMAN 10g This strategy allows for backups to tape but also to disk (flash backup) thus reducing substantially the recovery time for many common recovery scenarios

The backup to tape uses the IBM Tivoli Storage Manager (TSM) Oracle RMAN has dedicated drivers to connect to TSM and manage backups stored in tapes in a special backup format There are different kinds of backups done to tape

bull Full backup (level 0) - a complete copy of the database and control files bull Differential backup (level 1) - copy of blocks changed since latest level 0 or level 1 backup bull Cumulative backup (level 1) - copy of blocks changed since latest level 0 backup bull Archive logs - copy of the archive logs (files containing the log of the operations done in the

database)

For the backup on disk Oracle RMAN copies all data files into the flash recovery area (typically on a different storage array than the data files) then all the subsequent backups are differential We use a technique called Incrementally Updated Backups to maintain this type of backup

The Oracles block change tracking feature is used to significantly reduce the latency and weight of the incremental DB backups (only changed DB blocks are read during a backup with this optimization)

The backup on tape retention policy is set to 31 days This guarantees that a copy of the database can be recovered in a time window of 31 days In addition we propose that a full backup is systematically performed and kept before any ORACLE software upgrade The schedule for the backup to tape is as follows

bull Full - every 2 weeks bull Incremental (differential or cumulative) ndash daily bull Archive logs - every 30 minutes

The backup on disk retention is set to 2 days and allows for database recoveries in this time frame The schedule for the backup on disk is as follows

bull Full - at database creation bull Incremental ndash daily

A dedicated system for recoveries is available The system allows for periodic test recoveries

point-in-time recoveries (from disk within a two days latency) and tape-backup based disaster recoveries At present a typical recovery takes 30MBsec assuming that a channel for tape reading is available In addition an overhead of a couple of hours is estimated for the DBA to understand and analyse the problem To give an idea the recovery of a 300GB database would take a total of 5 hours (3 for reading from tape plus 2 hours overhead)

7 Security and software updates policies Critical Patch Updates (CPU) are the primary means of releasing security fixes for Oracle products They are released on the Tuesday closest to the 15th day of January April July and October

The physics database services follow the practice of regularly applying those patches in a timely manner depending on the security vulnerability The proposed date for this planned intervention is announced and negotiated with the user community

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

5

Oracle recommends that each CPU is first deployed on a test system In order to achieve timely deployment on the production systems critical patch updates are applied typically within two weeks from the publishing date and after a validation period of one week on the validationpre-production services

Oracle software upgrades are typically performed once or twice per year The new version will be installed on the validation RAC and tested by the application owners and Tier 1 sites typically over a period of one month

8 Replication set-up for the 3D project The physics database services at CERN are heavily involved in the distributed database deployment (3D) project [4] In order to allow LHC data to flow through this distributed infrastructure an asynchronous replication technique (Oracle Streams) is used to form a database backbone between online and offline and between Tier-0 and the Tier-1 sites New or updated data from the online or offline databases systems are detected from database logs and then queued for transmission to all configured destination databases Only once data has been successfully applied at all destination databases it is removed from message queues at the source A dedicated downstream capture setup is used for replication via the wide area network to further insulate the source database in case of problems in connecting to Tier-1 replicas

At present all the 10 Tier1 sites involved in the project are in production as shown in figure 4 Currently the database service team at CERN and at the Tier1 sites provide a 85 coverage for interventions An archive-log retention of five days is set-up to cover any replication problem which may occur outside working hours

Figure 4 Streams dataflow and topology in the 3D streams monitor

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

6

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7

teams At database level monitoring is achieved via Oracle Enterprise Manager and in-house developed scripts for ASM and oracle services This includes phone Short Messaging System (SMS) and email alerting system which allows a prompt reaction from the DBA team

6 Backup and recovery policies The database services for physics implement a backup strategy based on ORACLE RMAN 10g This strategy allows for backups to tape but also to disk (flash backup) thus reducing substantially the recovery time for many common recovery scenarios

The backup to tape uses the IBM Tivoli Storage Manager (TSM) Oracle RMAN has dedicated drivers to connect to TSM and manage backups stored in tapes in a special backup format There are different kinds of backups done to tape

bull Full backup (level 0) - a complete copy of the database and control files bull Differential backup (level 1) - copy of blocks changed since latest level 0 or level 1 backup bull Cumulative backup (level 1) - copy of blocks changed since latest level 0 backup bull Archive logs - copy of the archive logs (files containing the log of the operations done in the

database)

For the backup on disk Oracle RMAN copies all data files into the flash recovery area (typically on a different storage array than the data files) then all the subsequent backups are differential We use a technique called Incrementally Updated Backups to maintain this type of backup

The Oracles block change tracking feature is used to significantly reduce the latency and weight of the incremental DB backups (only changed DB blocks are read during a backup with this optimization)

The backup on tape retention policy is set to 31 days This guarantees that a copy of the database can be recovered in a time window of 31 days In addition we propose that a full backup is systematically performed and kept before any ORACLE software upgrade The schedule for the backup to tape is as follows

bull Full - every 2 weeks bull Incremental (differential or cumulative) ndash daily bull Archive logs - every 30 minutes

The backup on disk retention is set to 2 days and allows for database recoveries in this time frame The schedule for the backup on disk is as follows

bull Full - at database creation bull Incremental ndash daily

A dedicated system for recoveries is available The system allows for periodic test recoveries

point-in-time recoveries (from disk within a two days latency) and tape-backup based disaster recoveries At present a typical recovery takes 30MBsec assuming that a channel for tape reading is available In addition an overhead of a couple of hours is estimated for the DBA to understand and analyse the problem To give an idea the recovery of a 300GB database would take a total of 5 hours (3 for reading from tape plus 2 hours overhead)

7 Security and software updates policies Critical Patch Updates (CPU) are the primary means of releasing security fixes for Oracle products They are released on the Tuesday closest to the 15th day of January April July and October

The physics database services follow the practice of regularly applying those patches in a timely manner depending on the security vulnerability The proposed date for this planned intervention is announced and negotiated with the user community

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

5

Oracle recommends that each CPU is first deployed on a test system In order to achieve timely deployment on the production systems critical patch updates are applied typically within two weeks from the publishing date and after a validation period of one week on the validationpre-production services

Oracle software upgrades are typically performed once or twice per year The new version will be installed on the validation RAC and tested by the application owners and Tier 1 sites typically over a period of one month

8 Replication set-up for the 3D project The physics database services at CERN are heavily involved in the distributed database deployment (3D) project [4] In order to allow LHC data to flow through this distributed infrastructure an asynchronous replication technique (Oracle Streams) is used to form a database backbone between online and offline and between Tier-0 and the Tier-1 sites New or updated data from the online or offline databases systems are detected from database logs and then queued for transmission to all configured destination databases Only once data has been successfully applied at all destination databases it is removed from message queues at the source A dedicated downstream capture setup is used for replication via the wide area network to further insulate the source database in case of problems in connecting to Tier-1 replicas

At present all the 10 Tier1 sites involved in the project are in production as shown in figure 4 Currently the database service team at CERN and at the Tier1 sites provide a 85 coverage for interventions An archive-log retention of five days is set-up to cover any replication problem which may occur outside working hours

Figure 4 Streams dataflow and topology in the 3D streams monitor

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

6

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7

Oracle recommends that each CPU is first deployed on a test system In order to achieve timely deployment on the production systems critical patch updates are applied typically within two weeks from the publishing date and after a validation period of one week on the validationpre-production services

Oracle software upgrades are typically performed once or twice per year The new version will be installed on the validation RAC and tested by the application owners and Tier 1 sites typically over a period of one month

8 Replication set-up for the 3D project The physics database services at CERN are heavily involved in the distributed database deployment (3D) project [4] In order to allow LHC data to flow through this distributed infrastructure an asynchronous replication technique (Oracle Streams) is used to form a database backbone between online and offline and between Tier-0 and the Tier-1 sites New or updated data from the online or offline databases systems are detected from database logs and then queued for transmission to all configured destination databases Only once data has been successfully applied at all destination databases it is removed from message queues at the source A dedicated downstream capture setup is used for replication via the wide area network to further insulate the source database in case of problems in connecting to Tier-1 replicas

At present all the 10 Tier1 sites involved in the project are in production as shown in figure 4 Currently the database service team at CERN and at the Tier1 sites provide a 85 coverage for interventions An archive-log retention of five days is set-up to cover any replication problem which may occur outside working hours

Figure 4 Streams dataflow and topology in the 3D streams monitor

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

6

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7

Figure 5 PhEDEx transfer rates using a 6-node (dual-CPU Xeon) RAC (left) and a dual-CPU quad-core server as database backend

9 Service upgrade in 2008 The database services for physics has conducted extensive tests on a latest-generation server based on dual-CPU quad-core Xeon processors with 16 GB of FB-DIMM memory The tests have focused on CPU and memory access performance using a Oracle-generated workload Results have been compared with a system representative of the infrastructure currently deployed in production ie a 6-node RAC cluster of dual-CPU PIV Xeon servers with 4GB of DDR2 400 memory each

The results of the performance tests show that quad-core servers display approximately a 5 fold performance increase compared to the existing production for memory-intensive workload (negligible physical IO in the workload) In other words for memory and CPU intensive jobs a single-node quad-core server performs as five-node RAC of our current set-up Moreover tests have shown that the quad-core can also increase the overall performance of single-threaded CPU-intensive operations thanks to improved CPU-to-memory access

Figure 5 shows the dual-CPU quad-core server performance with respect to those currently deployed at CERN with PhEDEx (Physics Experiment Data Export) a highly distributed and transaction oriented application used by CMS to control file replication between LCG GRID sites During the test the software was managing artificial file transfers using a 6-node RAC from our current set-up (left) and the quad-core server (right) as the database backend In the latter configuration PhEDEx was able to handle transfers with a average rate of about 250 GBs compared to the 200 GBs of the current set-up

We have therefore decided to base our next order on dual-CPU quad-core servers running 64 bit Oracle server software with RAC clusters of typically four nodes

10 Conclusions The Database Services for physics at CERN run production and integration services fully based on Oracle 10g RACLinux clusters The architectural choice has been driven by the needs of the WLCG user community for reliability performance and scalability

We have deployed in production one of the biggest Oracle database cluster installations worldwide well sized to match the needs of the experiments in 2008 This set-up is also connected to 10 Tier1 sites for synchronized databases with whom we share security updates and backup polices and procedures all in line with the WLCG service procedures

11 References [1] CERN IT PSS 2006 Oracle physics database services support levels (httpstwikicernchtwikipubPSSGroupPhysicsDatabasesSectionservice_levelsdoc) [2] The Oracle Real Application Clusters Special Interest Group (RAC SIG) httpwikioraclecompageOracle+RAC+SIGt=anon [3] Lessons Learned From WLCG Service Deployment Jamie Shiers to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007 [4] Production Experience with Distributed Deployment of Databases for the LHC Computing Grid Dirk Duellmann to appear in the Proceedings of the International Conference on Computing in High Energy Physics Victoria BC September 2007

International Conference on Computing in High Energy and Nuclear Physics (CHEPrsquo07) IOP PublishingJournal of Physics Conference Series 119 (2008) 052017 doi1010881742-65961195052017

7