CERN Computing Infrastructure Evolution
Tim Bell, Helge Meinhard Tim.Bell (at) cern.ch , Helge.Meinhard (at) cern.ch
DESY HH Computing Symposium
17th December 2012
1 17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT
Agenda • Challenge to address • Ongoing projects
– Data centre expansion – Configuration management – Infrastructure as a Service – Monitoring
• Timelines • Summary
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 2
Challenge to Address
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 3
The Challenge
• LHC will be upgraded from 8 TeV to 14 TeV – The accelerator will be shutdown from February 2013 to end
2014 • Data rates are expected to increase significantly
– Luminosity expected above design 1034 – Not clear yet whether 14 TeV machine will run with 50 ns or 25
ns bunch spacing – Trigger rates are expected to double
• Computing requirements will need to match – Moore’s law alone may not be enough
• The experience with LEP can help to understand the trends
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 4
LEP Computing Trends Hint at the LHC Needs
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 5
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
1987
0119
8722
1987
4319
8812
1988
3319
8902
1989
2319
8944
1990
1319
9034
1991
0319
9124
1991
4519
9214
1992
3519
9304
1993
2519
9346
1994
1619
9437
1995
0619
9527
1995
4819
9617
1996
3819
9708
1997
2919
9750
1998
1919
9840
1999
0819
9929
1999
5020
0022
CPU
Cap
acity
(Spe
cInt
2000
hou
rs/w
eek)
Week
CPU Growth at CERN to end of LEP Total Capacity
CPU timedelivered
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 6
• Data Centre by Numbers – Hardware installation & retirement
• ~7,000 hardware movements/year; ~1,800 disk failures/year
Xeon 5150 2%
Xeon 5160 10%
Xeon E5335
7% Xeon
E5345 14%
Xeon E5405
6% Xeon E5410 16%
Xeon L5420
8%
Xeon L5520 33%
Xeon 3GHz
4%
Fujitsu 3%
Hitachi 23%
HP 0%
Maxtor 0%
Seagate 15%
Western Digital
59%
Other 0%
High Speed Routers (640 Mbps → 2.4 Tbps)
24
Ethernet Switches 350
10 Gbps ports 2,000
Switching Capacity 4.8 Tbps
1 Gbps ports 16,939
10 Gbps ports 558
Racks 828
Servers 11,728
Processors 15,694
Cores 64,238
HEPSpec06 482,507
Disks 64,109
Raw disk capacity (TiB) 63,289
Memory modules 56,014
Memory capacity (TiB) 158
RAID controllers 3,749
Tape Drives 160
Tape Cartridges 45,000
Tape slots 56,000
Tape Capacity (TiB) 73,000
IT Power Consumption 2,456 KW
Total Power Consumption 3,890 KW
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 7
Data Storage Growth
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 8
• 6 GB/s average • 25 GB/s peaks • 1.2 PB/week to tape • ~ 35 PB expected to be recorded in 2012 • >20 years retention
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 9
45,000 tapes holding 80 PB of physics data
On-going Projects: Data Centre Expansion
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 10
New Data Centre to Expand Capacity
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 11
• Data centre in Geneva at the limit of electrical capacity at 3.5 MW
• Additional 2.7 MW of usable power
• Hands off facility • Deploying from 2013
with 2 * 100 Gbps circuits to CERN
CERN IT Department CH-1211 Geneva 23
Switzerland www.cern.ch/it
CF Wigner Data Centre Location • Easy reach to the airport and city centre • Huge area within the fence • Highly secure area • Area only for research institution
CERN IT Department CH-1211 Geneva 23
Switzerland www.cern.ch/it
CF Artist’s Impression
PES Group Meeting 20th November 2012 - 13
CERN IT Department CH-1211 Geneva 23
Switzerland www.cern.ch/it
CF Technical Overview • New facility due to be ready at the end of 2012 • 1100m² (725m²) in an existing building but new infrastructure • 4 (3) blocks of 275m² each with six rows of 21 racks with
average power density of 10kW and A+B feeds to all racks • 2 independent HV lines to site • Full UPS and diesel coverage for all IT load (and cooling)
– 3 UPS systems per block (A+B for IT and one for cooling and infrastructure systems)
• Maximum 2.7MW • In-row cooling units with N+1 redundancy per row (N+2 per
block) • N+1 chillers with free cooling technology (under 18°C*) • Well defined team structure for support • Fire detection and suppression for IT and electrical rooms • Multi-level access control; site, DC area, building, room • Estimated PUE of 1.5
PES Group Meeting 20th November 2012 - 14
CERN IT Department CH-1211 Geneva 23
Switzerland www.cern.ch/it
CF Data Centre Layout & Ramp-up
PES Group Meeting 20th November 2012 - 15
CERN IT Department CH-1211 Geneva 23
Switzerland www.cern.ch/it
CF WAN Routing
PES Group Meeting 20th November 2012 - 16
On-going Projects: Configuration Management
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 17
Time to Change Strategy
• Rationale – Need to manage twice the servers as today – No increase in staff numbers – Tools currently in use becoming increasingly brittle and will not scale as-is
• Approach – We are no longer a special case for compute – Adopt a tool chain model using existing open source tools – If we have special requirements, challenge them again and again – If useful, make generic and contribute back to the community – Address urgent needs first in
• Configuration management • Monitoring • Infrastructure as a Service
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 18
Building Blocks
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 19
Bamboo
Koji, Mock
AIMS/PXE Foreman
Yum repo Pulp
Puppet-DB
mcollective, yum
JIRA
Lemon / Hadoop
git
OpenStack Nova
Hardware database
Puppet
Active Directory / LDAP
Configuration: Choose Puppet
• The tool space has exploded in last few years – In configuration management and operations
• Puppet and Chef are the clear leaders for ‘core tools’ • Many large organisations now use Puppet
– Its declarative approach fits what we’re used to at CERN – Large installations: friendly, wide-based community – You can buy books on it – You can employ people who know it well
Tim Bell / Helge Meinhard, CERN-IT 20 17-Dec-2012
Puppet Experience
• Excellent: basic puppet is easy to setup and can be scaled-up well
• Well documented, configuring services with it is easy • Handle our cluster diversity and dynamic clouds well • Lots of resource (“modules”) online, though of varying quality • Large, responsive community to help • Lots of nice tooling for free
– Configuration version control and branching: integrates well with git – Dashboard: we use the Foreman dashboard
• We’ve started to deploy production services with Puppet • We’re moving all our production service over in 2013
Tim Bell / Helge Meinhard, CERN-IT 21 17-Dec-2012
Additional Puppet-Related Functionality
• Hiera provides a simple way to do conditional configuration – NTP Server = NTP1 if machine is in Geneva, NTP2 if in Budapest
• PuppetDB provides a configuration data warehouse – Show me all the machines which have less than 4GB
• Mcollective – Run this command on any machine which has a RAID card
• Other OS support – Testing on Mac and Linux desktops
Tim Bell / Helge Meinhard, CERN-IT 22 17-Dec-2012
Tim Bell / Helge Meinhard, CERN-IT 23 17-Dec-2012
On-going Projects: Infrastructure as a Service
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 24
Prepare the Move to the Clouds
• Improve operational efficiency – Machine reception and testing – Hardware interventions with long running programs – Multiple operating system demand
• Improve resource efficiency – Exploit idle resources, especially waiting for tape I/O – Highly variable load such as interactive or build machines
• Improve responsiveness – Self-Service – Coffee break response time
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 25
Public Procurement Purchase Model Step Time (Days) Elapsed (Days)
User expresses requirement 0
Market Survey prepared 15 15
Market Survey for possible vendors 30 45
Specifications prepared 15 60
Vendor responses 30 90
Test systems evaluated 30 120
Offers adjudicated 10 130
Finance committee 30 160
Hardware delivered 90 250
Burn in and acceptance 30 days typical 380 worst case
280
Total 280+ Days 17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 26
Service Model
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 27
• Pets are given names like pussinboots.cern.ch
• They are unique, lovingly hand raised and cared for
• When they get ill, you nurse them back to health
• Cattle are given numbers like vm0042.cern.ch
• They are almost identical to other cattle • When they get ill, you get another one
• Future application architectures tend towards Cattle but Pets with configuration management are also viable
What is OpenStack ?
• OpenStack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacentre, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 28
Principles
• Open Source – Apache 2.0 license, NO ‘enterprise’ version
• Open Design – Open Design Summit, anyone is able to define core architecture
• Open Development – Anyone can involve development process via Launchpad & Github
• Open Community – OpenStack Foundation in 2012, Now 190+ companies, 3000+
developers, 6000+ members
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 29
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 30
Foundation Governance
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 31
OpenStack Cloud Components
• Nova – Compute Layer like Amazon EC2 • Swift – Object Store like Amazon S3 • Quantum – Networking such as SDN or load balancing • Horizon – Dashboard GUI • Cinder – Block Storage like Amazon EBS • Keystone – Identity • Glance – Image management
• Each component has an API and is pluggable • Other non-core projects interact with these components such as load
balancers and usage tracking 17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 32
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 33
Many OpenStack Components to Configure
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 34
Compute Scheduler
Network Volume
Registry Image
KEYSTONE HORIZON
NOVA GLANCE
When Communities Combine…
• OpenStack’s many components and options make configuration complex out of the box
• Puppet forge module from PuppetLabs (Thanks, Dan Bode) • The Foreman adds OpenStack provisioning for user kiosk
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 35
Scaling up with Puppet and OpenStack
• Use LHC@Home based on BOINC for simulating magnetics guiding particles around the LHC
• Naturally, there is a puppet module puppet-boinc • 1000 VMs spun up to stress test the hypervisors with Puppet,
Foreman and OpenStack
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 36
Opportunistic Clouds in Online Experiment Farms • The CERN experiments have farms of 1000s of Linux servers
close to the detectors to filter the 1PByte/s down to 6GByte/s to be recorded to tape
• When the accelerator is not running, these machines are currently idle
– Accelerator has regular maintenance slots of several days – Long shutdown due from March 2013 – early 2015
• One of the experiments have deployed OpenStack on their farm
– Simulation (low I/O, high CPU) – Analysis (high I/O, high CPU, high network)
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 37
Federated European Clouds
• Two significant European projects around Federated Clouds – European Grid Initiative Federated Cloud as a federation of grid sites providing
IaaS – HELiX Nebula European Union funded project to create a scientific cloud
based on commercial providers
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 38
EGI Federated Cloud Sites
CESGA CESNET INFN
SARA
Cyfronet FZ Jülich SZTAKI
IPHC
GRIF GRNET KTH
Oxford
GWDG IGI TCD
IN2P3
STFC
On-going Projects: Monitoring
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 39
Monitoring Evolution
• Motivation – Several independent monitoring activities in IT – Based on different tool-chain but sharing same limitations – High level services are interdependent – Combination of data and complex analysis necessary
• Quickly answering questions you hadn’t though of when data recorded
• Challenge – Find a shared architecture and tool-chain components – Adopt existing tools and avoid home grown solutions – Aggregate monitoring data in a large data store – Correlate monitoring data and make it easy to access
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 40
Monitoring
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 41
Application Specific
Aggregation
Storage Feed
Analysis
Storage
Alarm Feed
Alarm Portal Report
Custom Feed
Publisher Sensor Publisher Sensor
Portal
Apollo
Lemon
Hadoop Oracle
Splunk
Final Remarks, Timeline, Conclusions
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 42
Collaboration and Communities • CERN’s Agile Infrastructure project contributes back to the
open source communities – Thus, much collaboration focus is world-wide, industry-independent – Public mailing lists have more experts – There is much interest in our use cases and problems
• However, some problems are specific to our communities – Grid specific software – Batch systems
• Collaborations already starting around – OpenStack with BNL, IN2P3, NECTaR (Australia), IHEP (China),
Atlas/CMS Trigger teams – Puppet with LBNL, BNL, Atlas Online/Offline results shared at
http://github.com/cernops and [email protected] 17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 43
Ongoing Activities • Preparation for production
– Hardening infrastructure for high availability
• Configuration Management – Migrate to Puppet 3.0 – Assist experiments to convert from Quattor to Puppet
• Cloud – Accounting and quota management collaboration with BARC and
Rackspace – Investigating Gluster, NetApp and Ceph integration for block storage – Load Balancing – Federation with online trigger farms
• Prepare for scale – 15,000 hypervisors running 100-300K VMs by 2015
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 44
Timelines
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 45
Year What Actions
2012 Establish IaaS in Meyrin data centre Monitoring Implementation Early adopters to Agile Infrastructure
2013 Long Shutdown 1 New Data Centre
Production Agile Infrastructure Extend IaaS to remote CC Business Continuity Support Experiment re-work Migrate existing virtualisation solutions General migration to Agile with SLC6
2014 To the end of LS1 Phase out Quattor/CDB/… Remote data centre fully operational towards 15K hypervisors and over 100K VMs
Conclusions
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 46
• CERN’s computing needs continue to grow • The second data centre in Budapest provides the
facilities • A move to Open Source popular tools used widely
elsewhere provides a framework to address these new capabilities while maintaining staff numbers
• Enabling cloud computing provides many potential opportunities to users and service providers
• Collaboration with other sites is welcome
Questions ?
17-Dec-2012 Tim Bell / Helge Meinhard, CERN-IT 47
CHEP 2012 paper on Agile Infrastructure http://cern.ch/go/N8wp
CERN Puppet Configurations http://github.com/cernops
HELiX Nebula http://helix-nebula.eu/
EGI Cloud Taskforce https://wiki.egi.eu/wiki/Fedcloud-tf
Top Related