Ceph Day Berlin: Scaling an Academic Cloud

21
Scaling an Academic Cloud with Ceph 28.04.2015 | Berlin, Germany Ceph Day Berlin Christian Spindeldreher Enterprise Technologist Dell EMEA

Transcript of Ceph Day Berlin: Scaling an Academic Cloud

Page 1: Ceph Day Berlin: Scaling an Academic Cloud

Scaling an Academic Cloud with Ceph

28.04.2015 | Berlin, Germany

Ceph Day Berlin

Christian Spindeldreher Enterprise Technologist Dell EMEA

Page 2: Ceph Day Berlin: Scaling an Academic Cloud

The Cloud

2

Page 3: Ceph Day Berlin: Scaling an Academic Cloud

The

Software-Defined Datacenter

3

Page 4: Ceph Day Berlin: Scaling an Academic Cloud

Defining “software-defined”

The capabilities

• Compute

• Storage/availability

• Networking/ security & management

The benefits

• Automated & simplified

• Unlimited agility

• Maximum efficiency

SDN

SDS

SDC

SDE

4

Data plane

Control plane

Traditional system

Purpose-built hardware & software

General-purpose hardware

Software-defined

Open standard, e.g., OpenFlow

Next-gen compute block

Purpose-built function virtualized in general-purpose hardware

delivered as a service

The basics

Page 5: Ceph Day Berlin: Scaling an Academic Cloud

5

The Cloud Operating System Manage the Resources…

Page 6: Ceph Day Berlin: Scaling an Academic Cloud

6

Ceph and OpenStack

Page 7: Ceph Day Berlin: Scaling an Academic Cloud

Ceph in Academia & Research

7

Page 8: Ceph Day Berlin: Scaling an Academic Cloud

CLIMB project

8

picture from http://westcampus.yale.edu

• Collaboration between 4 Universities:

Birmingham, Cardiff, Swansea & Warwick

• Ceph environment across the 4 sites – part of a HPC Cloud to deploy virtual

resources for microbial bioinformatics (e.g. DNA sequencer output,…)

– shared data across the sites – robust solution with low €/TB ratio for

mid/long term storage – Ceph Solution by OCF, Inktank* & Dell

– more information: http://www.climb.ac.uk * now Red Hat

Page 9: Ceph Day Berlin: Scaling an Academic Cloud

CLIMB project

• 4 Ceph Clusters – 6.9PB raw capacity (total) – 3 replicas – at least 1 remote:

2.3PB useable capacity

– server infrastructure (per site) › 5 MON nodes › 2 Gateway nodes

– R420, 4x 10GbE

› 27 OSD nodes – R730xd, 16x 4TB, 2 SSDs, 2x 10GbE

– network infrastructure › Brocade VDX6740T switches

– 48x 10GbE, 4x 40GbE

9

Page 10: Ceph Day Berlin: Scaling an Academic Cloud

S3IT − Central IT, University of Zurich (UZH)

• UZH – some interesting facts – 26.000 enrolled students – Switzlerland‘s

largest university – member of the “League European Research

Universities” (LERU) – international renown in medicine, immunology,

genetics, neuroscience, structural biology, economics,…

› 12 UZH scholars have been awarded the Nobel Prize

• Scale-Out Storage for Scientific Cloud (based on OpenStack) – based on Ceph – commodity components – ethernet network – good balance between performance, capacity & cost

10

picture: http://www.hausarztmedizin.uzh.ch/index.html

Page 11: Ceph Day Berlin: Scaling an Academic Cloud

S3IT − Central IT, University of Zurich (UZH)

• Requirements for High-Capacity Tier – 4.2PB raw capacity (1st batch)

› cinder volumes, glance images, ephemeral disks of VMs, radosgw (S3-like object storage)

› replication, erasure coding & cache tiering – R630 + 2x MD1400 JBOD

› 24x 4TB nSAS › 6x 800GB SSD (in R630)

• Requirements for High-Performance Tier

– 112TB raw capacity (1st batch) › block access › SSD pool, replicated

– R630 › 8x 1.6TB SSD

• Network – scale-out 40GbE back-bone:

2x Z9500 (132x 40GbE in 3RU) – ToR: S4810 (48x 10GbE, 4x 40GbE)

11

Page 12: Ceph Day Berlin: Scaling an Academic Cloud

Requirements in Academia, Science & Research today What we see…

• Ceph Stand-Alone vs. OpenStack-related

• Large Scale Environments – 5PB / 20PB / 100PB target capacity – usually object

• Multi-Site Environments – cross-site replication – unified object space – searchable meta data

› out-of-scope for Ceph?!

12

Page 13: Ceph Day Berlin: Scaling an Academic Cloud

Design Considerations

13

Page 14: Ceph Day Berlin: Scaling an Academic Cloud

Infrastructure Considerations – Storage Nodes

• Form Factors – Small Nodes vs. Big Nodes

vs. Super-Nodes – Node Count – Ethernet-based Drives

• Use of SSDs – Journaling – Cache Tiering – SSD-only Pools – Check new SSD Types

› PCIe, form factors (1.8“ size), write endurance,…

14

Page 15: Ceph Day Berlin: Scaling an Academic Cloud

Infrastructure Considerations – Storage Node Example

• Storage Node: R730xd – 2 RU – 1 or 2 CPUs – local drives

› 16x 3.5“ HDD slots (+ 2x 2.5“ for boot) – up to 6TB per drive today (96TB total)

› 24x 2.5“ HDD slots (+ 2x 2.5“ for boot) › 8x 3.5“ HDD slots + 18x 1.8“ SSDs

(+ 2x 2.5“ for boot)

– highly flexible system – JBOD expansion optional

15

Page 16: Ceph Day Berlin: Scaling an Academic Cloud

Infrastructure Considerations – Storage Node Example

• Head Node: R630 – 1 RU – 1 or 2 CPUs – local drives

› 10x 2.5“ HDD slots or › 24x 1.8“ SSDs › could host Write Journaling, Cache Tiering or

SSD-only pools (then without a JBOD)

• JBOD: MD3060e – 4 RUs – SAS attach – 60x 3.5“ HDD slost

› up to 6TB per drive today (360TB total)

• VoC (example) – “Write Journal on SSD has no real impact

with 60 HDDs“

16

SAS

Page 17: Ceph Day Berlin: Scaling an Academic Cloud

Infrastructure Considerations – Network

• Client-facing vs. Cluster-internal IO – be aware of replication traffic

• ToR – 1x or 2x 10GbE Switch

› failure domain?!

– 40GbE Uplinks

• Distributed Core – Scale-Out Core-Switch Design – 40/50/100GbE Mesh – Virtual Link Trunking (VLT) for HA/Load-

Balancing

17

Page 18: Ceph Day Berlin: Scaling an Academic Cloud

Infrastructure Considerations – the Site/DC…

• Power & Cooling – high density has some impacts – example for 1 rack (42 RUs)

› R630 & MD3060e building block / 8 units › input power: › weight: › raw capacity:

• Fresh Air Technology – use higher air temperature for cooling – 25°C vs. 30°C vs. 40°C

18

High Density: TACC Stampede Cluster

› 21kW › ~ 1000kg

› 2.9PB

Dell Fresh Air Hot House, Round Rock TX

Page 19: Ceph Day Berlin: Scaling an Academic Cloud

19

Dell|Inktank (now RH) Ceph Reference Architecture

HW + SW + Services

Hardware

HW Reference Architecture

• R730xd Servers • Storage and compute • Dell S/Z-Series Switches

Configuration • Min of 6 nodes:

3x MON + 3x Data

Software

Software • Inktank ICE platform • optional OpenStack cloud

software

Operating System • RHEL • SUSE, Ubuntu,…

Access • Object & Block (today)

Services

Deployment • Onsite HW Install • Onsite SW Install • Whiteboard session & training

Support • HW: Dell ProSupport • SW: OpenStack support

Solution based on (e.g.):

• Server nodes:

• R730xd,…

• Fully populated drives

• Dell F10 10/40GbE switches

• Modules are flexible

Page 20: Ceph Day Berlin: Scaling an Academic Cloud

Dell Solution Centers

• 30-90 minute briefings

• 1-4 hour Design Workshops

• 5-10 days Proofs-of-Concept for hands-on “prove-it”

20

Page 21: Ceph Day Berlin: Scaling an Academic Cloud

Thank You!

[email protected]