Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End...

27
ClusterVision Engineer Innovate Integrate on HPC Cluster Solutions 100% Focus Alex Ninaber Technical Manager ClusterVision

Transcript of Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End...

Page 1: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

on HPC Cluster Solutions

100% Focus

Alex Ninaber

Technical ManagerClusterVision

Page 2: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Overview

About ClusterVision

ClusterVision Technologies:• Remote Administration• Power Saving HPC Chassis• Application Resource Balancing

Other Technologies:• Bright Cluster Manager

ClusterVision Machine Evaluation Workshop 2012

Page 3: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

About ClusterVision

• Specialists in Compute Clusters• End-to-End solutions hardware & software:

advice, design, onsite implementation, support• ClusterVision: 38 employees • Bread & Butter: European tenders• 4 focus regions: United Kingdom, France, Germany, Benelux• Active in whole of EMEA & Asia: customers in Saudi Arabia, India, Qatar,

Norway, Sweden, Finland, Spain, Switzerland, Ireland, Italy, Austria, etc.• ISO9001:2008 & ISO14001 certified• Financially strong & growing• More than 400 projects and 250 customers in 10 years• Direct flights between Amsterdam and United Kingdom: Leeds, London, Bristol,

Durham, Newcastle, Glasgow, Manchester, Liverpool, Edinburgh, Southampton, Birmingham, Belfast, Aberdeen, Cardiff, Exeter, Humberside, Dundee, Isle of Man, Guernsey, Norwich, Kent

• UK: Growth of development, project and (pre)-sales team

Page 4: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Capability AssessmentBenchmarking

New System Design

Replacement & Upgrades

Assembly

Configuration

POC

Racking

Provisioning

Compatibility

Certification

Project Management

Application Analytics

Support & Maintenance

Remote Administration

Education

Maintenance

ClusterVision: End-to-End HPC Cluster Solutions

GPU/Accelerators

Infiniband

Parallel Filesystems

Oil, Direct, Backdoor Cooling

Application TuningHardware System Design

HPC, BigData & CloudSoftware Development

HPC software

Page 5: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Integrating Leading Brand ManufacturersBenefit• Trusted brands• High-quality• Proven capability• Compatible components• Reliable & Robust• Active Warranty ..

Page 6: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

ClusterVision Benchmark Applications include ..

Manufacturing/CAEAbaqus/Simulia (Explicit FEA)LS-DYNA (ESI) (Crash/Impact)

MAGMASOFT (Casting)Nastran/Dytran/Marc (Explicit FEA)

PAMCRASH (Crash/Impact)

Fluid Dynamics (CFD)Fluent/CFX (ANSYS)

Star-CCM+ (CDAdapco)OpenFoam/FOAMPro (SGI/Icon)

NUMECASWAN (Wave Modelling)ROMS (Ocean Modelling)

ChemistryCASTEP (Atomic Modelling)

CHARMM (Molecular Mechanics)CPMD (Molecular Dynamics)

GAUSSIAN (Molecular Electronics)GROMACS (Molecular Dynamics)MOLPRO (Quantum Chemistry)NAMD (Biomolecular Modelling)NWChem (Molecular Mechanics)

Quantum ESPRESSO SCM ADF (Electro-Structure)

TURBOMOLE (Quantum Chemistry)VASP (Molecular Dynamics)

Physics & AstronomyFLASH (Astrophysics)

HEP-SPEC (CERN Benchmarking)Monte-Carlo Particle (Quantum

Physics)

MathematicsMATLAB (Mathworks)

MAPLE/MAPLESim (MapleSoft)Mathematica (Symbolic Calculation)

Monte-Carlo Particle (Quantum Physics)

BioInformaticsAmber 11 (BIochemistry)

BLAST (DNA Sequencing)ClustalW (DNA Sequencing)

COPIA (Pattern Detection/DNA)EMBOSS (Molecular Biology)

FASTA (Protein Analysis)HMMR 3 (HMM Protein Sequencing)

MrBayes (Phylogeny/Evolution)PatternHunter (Genome Analysis)PROSPECT (Protein Evaluation)

RasMol (DNA Struture)

Climatology/ExplorationALADIN (European Weather Prediction)

Hadley Centre Model (MetOffice)HIRLAM (European Weather Prediction)

WRF (Weather Prediction)ECLIPSE (Resevoir Simulation)

ROXAR (Oil&Gas Recovery)

Page 7: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Remote System AdministrationFeatures

Professional service packages• User Accounts & Support• Environment Monitoring & Reporting• Systems & Sub-systems• Tools & Applications• Audit & Security

Default: entry level base package• In combination with Bright Health-Checking• Offer to all large clusters by default• Full end-to-end monitoring: cooling, PDU, UPS,

servers, switches, Infiniband, MCEs, core switches, cables, services: Q-system, Bright, NFS, LMGRD etc.

• Know instantly when something is wrong• Software updates• Remote and onsite repair• Notification and explanation for actions• Management reporting

Page 8: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

RSA Service Packages

Page 9: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

• Project for German market / specific for Max Planck• Open Compute Facebook:• Optimised for HPC: Use repetition to cut cost where possible• Easy to maintain, possible Customer repair• Reusable, 5-8 year life-span, stay COTS where possible• Half-width ATX standards Boards: Intel, ASUS, SMC, Tyan,

Flextronics, MSI etc• COTS Components (fans, power supplies, etc)• Low power, big case fans, keeping air pressure on node failure• Independent fan, power and temperature monitoring, Pi?• Front cables = high room temperature• Water cooled rack door, direct CPU/GPU cooling, oil cooling?• When is the saving worth it?

>10% purchasing, >10% power , >15% next upgrade?

ClusterVision Chassis design for HPC

“Build one of the most efficient computing infrastructures at the lowest possible cost.”

Page 10: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Water Cooled Doors• UK manufactured by Usystems• 42U/48U, up to 45KW per rack• 14C, 21C and new 24C input temperatures• Almost free-air cooling• Remote monitoring, twice a year onsite checks

ASETEK: Direct CPU/GPU Water cooling• Cooling direct to the CPU/GPU and heat-exchange in chassis• Can be fitted to existing racks• Funded & produced in Denmark• Chassis fans reduced to minimum, 5W CPU/GPU pumps• Full closed enclosure, full self cooling. No CRAC-units

required• Water input ~40C, full free-air cooling 24/7 365 days• COTS: 40.000 units p/m produced, originally desktop product

Green Revolution Cooling: Oil Cooling• No chassis, just blades• No fans• Just Oil pumps• Water input ~40C, full free-air cooling 24/7 365 days

Cooling options for Chassis

Page 11: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Developing Queuing User Analytics' system based on Slurm.

• Estimate: 30%-50% cluster under utilised resources by bad codes• Define BAD: wrong compiler/mpi/math library usage, faulty multi-core/node scaling, bad

programming, bad cluster configuration• Goal: Automatically non-intrusively profiling all applications• Using information during Application run: CPU Performance counters, Memory usage,

Infiniband, GPU efficiency, I/O usage, power consumption• Generic applications & Applications with known profile

o Existing: Linpack on Intel SandyBridge > 90% efficiento Generic: monitor balances between Infiniband, FPU usage, memory etc

• Inform User at end of the run, inform Administrator of badly running codes• Research engineering project! A lot to learn ….

ClusterVision Application Analytics

AIM: increase overall true cluster utilisation by 20%

Page 12: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Bright Cluster ManagerAdvanced HPC cluster management made easy

Page 13: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Feature Bright Rocks(+) PCM xCAT

Integrated Cloud Bursting

Cluster health checking ? ?

Automatic OS Failover

Monitoring & Actions

Tight Node‐Switch integration

vSMP auto‐configuration

CLI to configure all & Command line ?

Yum updates: Compilers, MPI, Bright etc etc. 

Bright Cluster Manager

Come to our Stand for Video Demo!

Page 14: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Personal User LoginWhitepapers Library Troubleshooting help Digitalized site survey Open support calls Code examples Application setup Common Procedures

ClusterVision Service PortalOn-line Community Resource Library

www.clustervision.com

Page 15: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

www.clustervision.com

Page 16: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Remote System Administration

Benefits

• Efficient use of in-house resources• Cost-effective service solution• Remote access to expertise• Rapid response service• Professional quality• Accountable• Secure & Confidential• Comprehensive coverage• Scalable –Service Packages/Credits• Enhances existing services• Single point of contact …

Page 17: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Delivering Skills & Resources for Maximised ROI

Capability AssessmentNew & Upscale Design

Detailed SpecificationOn-site Assembly

ConfigurationApplication Tuning

QA & Burn-in TestingHPL Benchmarking

CertificationRemote Administration

User SupportSA & User Training

Maintenance & Repair..

services

Page 18: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

servicesPre-Delivery Professional Services

Capability AssessmentNew & Upscale DesignDetailed SpecificationProof of Concept

Benefit• Impartial, informed review of customer history, requirements & constraints• Benefit from ClusterVision’s knowledge & connections to maximise ROI• Independent selection from best-in-class technology, compatibilty assured• Open access to existing reference installations• Advise/ensure optimised performance of systems and user applications ..

Page 19: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

servicesPoint-of-Delivery Professional Services

Professional ISO Quality AssemblyCable ManagementProvisioning & ConfigurationBurn-in-TestingIndustry Standard CertificationHPL Benchmarking

Benefit• Reduced time/effort for SA’s, specialist skills/resources required• Consistent, quality installation, collaborative customisation to site needs• Quality assurance, diagnostics & consistency in acceptance process• Identifies non-compliance, ensures compatibility, future-proofing• Establishes/documents performance, allows diagnostics & tuning ..

Page 20: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

TU Ilmenau, Germany49 Dell PowerEdge R815 serversAMD Opteron™ 6134 Processors

192 terabytes storage capacityReduced Consumption by 10-15%

“Together, Dell and ClusterVision offered the specialised expertise we needed, and were able to provide a

customised, detailed proposal.” Hennig Schwanbeck, IT Manager of

Datacentre Administration

success

Energy Efficient HPC Cluster at the Technical University Ilmenau, Germany

Page 21: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

38.8 Tflops HPC Cluster at the University of Bordeaux, France

University of Bordeaux528 Intel Xeon X5675 Processors

3168 coresDell PowerEdge C6100 servers

Qlogic QDR Infiniband

“This new cluster is a huge step in the long story of supercomputers in

Bordeaux. We have made a powerful system for the whole scientific

community and lsmall and medium enterprises of Aquitaine.” Jean-Christophe Soetens, Scientific

Management of the MCIA.

success

Page 22: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Dortmund LIDO Cluster, Driving Research at the Virtual Numerics Laboratory

University of Dortmund420 AMD Opteron Processors

1.3 TBytes memory26 TBytes storage

1 TFLOPS peak performance

“We decided on ClusterVision because of their excellent reputation, and the outstanding price/performance ratio.

ClusterVision also best understood and fulfilled our requirements”, Jorg Gehrke,

Division Leader Server & HPC

success

Page 23: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Top500 Cluster at Ghent University is the Fastest Academic Facility in Belgium

University of Ghent, Belgium196 IBM Blade Servers

60 TBytes Disk20GB/s Infiniband Network

15.7 TFLOPS peak performance

“ClusterVision have successfully delivered the fastest academic

Supercomputer in Belgium”Danny Schellemans, Director of

ICT Ghent University

success

Page 24: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Germany's Fastest & Europe’s Most Efficient Commodity Supercomputer

Goethe University Frankfurt530 Dual core AMD Opteron

Myrinet/Ethernet4.2 TFLOPS performance

“The new dual-core installation from ClusterVision is an excellent match to

our requirements” Prof. Stefan Schramm, Director of Centre for

Scientific Computing

success

Page 25: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Driving Research in Manufacturing Technology at the University of Cambridge

University of Cambridge, UK1152 Intel Xeon Processors

576 Dell PowerEdge Servers60 TByte Storage

QLogic Infiniband Network27 TFLOP performance

“ClusterVision’s role has been key in rapidly turning the Dell-supplied

hardware into a usuable and managable cluster, ready for Top500

benchmarks” Dr.Paul CallejaDirector of High Performance

Computing

success

Page 26: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

“Blue Crystal” Predicting Global Climate Change at the University of Bristol

University of Bristol, UK3360 Intel Xeon cores

420 IBM x3450 servers QLOgic Infiniband network

200 TBytes storage40 TFLOPS peak performance

“ClusterVision ‘s flexibility and efficient deployment allowed us to have a far

more capable service within our budget and schedule” Dr. Ian Stewart

Director Advanced Computing Research Centre

success

Page 27: Alex Ninaber - Science and Technology Facilities Council · PDF file• End-to-End solutions hardware & software: ... faulty multi-core/node scaling, bad programming, ... • Single

ClusterVision Engineer Innovate Integrate

Human Genome Discovery with LEGION at UCL, London’s Global University

University College London, UK2500 Processor Cores

Infiniband Network24 TFLOP performance

“ClusterVision’s professional on-site engineering team completed the

installation of a highly distributed HPC system in a challenging environment”

Clare Gryce, Research Computing Manager UCL

success