Red Hat Storage Day - When the Ceph Hits the Fan

19
WHEN THE CEPH HITS THE FAN Dr. Wolfgang Schulze Director Global Storage Consulting Practice Red Hat October 20, 2016

Transcript of Red Hat Storage Day - When the Ceph Hits the Fan

Page 1: Red Hat Storage Day -  When the Ceph Hits the Fan

WHEN THE CEPH HITS THE FAN

Dr. Wolfgang Schulze Director Global Storage Consulting Practice Red Hat October 20, 2016

Page 2: Red Hat Storage Day -  When the Ceph Hits the Fan

CAN THE CEPH EVEN HIT THE FAN?

2

•  A"erall…

•  Architecturehasnosinglepointoffailure•  Codebaseisverysolidandhadmanyyearstomature•  Designedfromthegrounduptoaccommodateforfailures•  Supposedtobeself-healingandself-managing•  Itsimplifiesday-to-daydatacenteropera?ons

Page 3: Red Hat Storage Day -  When the Ceph Hits the Fan

WHAT IS “HITTING THE FAN”, ANYWAYS?

3

•  Examplescenarios:•  Heavystormtakesoutdatacenter,clusterfailstorestartautoma?cally•  Increasedworkloadmakesclusterunstable•  Performanceisfinewhenclusterisemptytomoderatelyfilled,butwhen

whengeHngclosephysicalcapacity,writeperformancedrops•  Nearlyfullclusterhasbecomeunresponsiveanddenieswrites•  Bulkdele?onofobjectstakessolongthattheclientapplica?on?mesout•  Rebalancinga"erapar?alelectricoutageimpactsclientswithslow/

blockedrequests

•  Resultineachcase:customerfiles•  Sev1:Produc?onisdown•  Sev2:Produc?onisimpacted

Page 4: Red Hat Storage Day -  When the Ceph Hits the Fan

TICKET QUEUE IN RED HAT SUPPORT

4

Realscreenshot,dated2016-10-19CustomernamesremovedManyofthese,cketscouldhavebeenavoidedifbestprac,ceshadbeenfollowed

Page 5: Red Hat Storage Day -  When the Ceph Hits the Fan

A SAD, BUT TRUE STORY

5

•  CustomerboughtRedHatCephStoragesubscrip?ons•  Theyweresuretheyhadenoughexperienceontheirteamandspecifically

declinedoffersfortrainingandconsul?ng•  TheydesignedanddeployedCephclusterwithoutguidance

•  Originallyforfeasibilitystudy,buteverythingseemedtoworkfine,sotheyputitintoproduc?on

•  Nobodyno?cedthatthejournalsizewasconfiguredtoonly100MBinsteadofbestprac?cesizeof5GB

•  Acoupleofmonthslatera"erapowerfailure,theCephclusterfailedtorecover•  Support?cketwentonforseveralweeks,attheendsomepermanentdataloss

•  Endresult:Par?aldataloss,unhappymanagement,unhappycustomers

Page 6: Red Hat Storage Day -  When the Ceph Hits the Fan

SOME COMMON MISCONCEPTIONS

6

•  ThenewtoolsmakeCepheasytosetup•  Youdon’tneeddetailedplanningorarchitecturedesign•  Cephworksonanyhardware,andyoucanmix&matchhardware•  Storageinfrastructurepeoplewillknowhowtohandletheproduct•  Serverpeoplewillknowhowtohandletheproduct•  Cephcommunitybitsarejustfine(“Weuseastablerelease”)•  Usingcommunitybitsismore“cuHngedge”

Page 7: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #1 UPSTREAM BITS FOR PRODUCTION SYSTEMS

7

Observa,on•  Userisrunningupstreambits•  ThishappensevenwithuserswhoarepayingforaRedHatSupportsubscrip?on•  Peoplemisinterpretthephrase“stablerelease”incommunityreleasenotes

Problem•  RedHatSupportwon’tbeabletohelp•  RedHatonlysupportslongtermstablereleases•  WhatcouldbeasafeandfullydocumentedupgradetoanewerLTSversion

suddenlybecomesa“migra?on”withrisksandpieallsMi,ga,on•  Usesupportedbits,stayinformedaboutroadmap,getinvolved

Page 8: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #2 USE OF UNSUPPORTED FEATURES

8

Observa,on•  Userdeployssystemintoproduc?onusingfeatureswhicharenot(yet)supported•  Examples:CephFS,BlueStore

Problem•  RedHatSupportwon’tbeabletohelp

•  Unlessyouhaveasupportexcep?on,theconversa?onmayendquickly•  RedHatEngineeringwillnotbuildhotfixesforyouMi,ga,on•  Trytogetasupportexcep?onfromRedHat•  Don’tusethefeature

Page 9: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #3 USE OF UNSUPPORTED CONFIGURATIONS

9

Observa,on•  UserdeployCephinawaythatisnotapprovedandhasnotbeentested•  Examples:

•  RunningCephonunsupportedOpera?ngSystemversions(e.g.GenToo,Debian)•  Deploying

Problem•  RedHatSupportwon’tbeabletohelp

•  Unlessyouhaveasupportexcep?on,theconversa?onmayendquickly•  RedHatEngineeringwillnotbuildhotfixesforyou

Mi,ga,on•  Readdocumenta?on,considerhealthcheckbeforego-live

Page 10: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #4 POORLY MANAGED CLUSTER GROWTH

10

Observa,on•  Addingdisks(orevenen?renodes)toclustersofrela?velysmalltotalcapacity•  Backfill/recoverystarvesclientI/O

Problem•  InolderversionsofCeph,defaultconfigura?onvaluesarenotidealforthis

(osd_max_backfills,osd_recovery_max_ac?ve,osd_recovery_op_priority)•  Ifyoufailtoadjustthesebeforeyouchangethephysicalconfigura?on,youwill

indeedhavehugeimpact

Mi,ga,on•  Knowyourstuff,thinkahead,es?mateimpact,graduallyweighin

Page 11: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #5 POOR SKILLS AND OPERATIONAL PRACTICES

11

Observa,ons•  SubjectmajerexpertswhobroughtCephtotheorganiza?onwerehiredguns,

oremployeeswhohavesincele"•  Teamthatendsupmanagingclusterconsidersitsomesortofblackart

Problem•  Operatorswhodon’tknowwhattheyaredoingputyourdataatrisk•  Thebuilt-insafety/durabilitymaybecompromised

Mi,ga,on•  Makesureusersreceivepropertraining,andavoidstaffSPOF•  Conductcontrolledemergencydrillstoprac?ceforoutages•  Maintainseparateclusterwithsameversionforexperimentsanddryrun,

orlearnhowtodoitwithacloudbasedenvironment

Page 12: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #6 RISKY CONFIGURATION CHOICES

12

Observa,ons•  Usersreadsomewherethatmoun?ngXFSOSD’swiththe‘nobarrier’op?on

willresultinperformancegains

Problem•  Whiletheperformancegetsno?ceablybejer,youareintroducingariskfor

datacorrup?onduringpoweroutages•  Thebuilt-insafety/durabilitymaybecompromised

Mi,ga,on•  Donotuse‘nobarrier’mountop?onunlessyouunderstandfullywhat

hardwareyouhave,andunlessyouknowwhatyouaredoing

Page 13: Red Hat Storage Day -  When the Ceph Hits the Fan

COMMON TROUBLE #7 POOR NETWORK CONFIGURATION

13

Observa,ons•  Usersdon’tpayenoughajen?ontonetworkconfigura?on•  Networkinconsistencies(e.g.JumboFrames)andbojlenecksgoundetected

…un?lCephperformspoorly.

Problem•  Troubleshoo?ngnetworkingissuesisdifficultandexpertshardtofind•  Cephheavilyreliesonproperconfigura?on

Mi,ga,on•  Investinyourteamandnetworkmaintenanceskills

Page 14: Red Hat Storage Day -  When the Ceph Hits the Fan

WHAT TO DO WHEN THINGS WENT WRONG

14

1.  Staycalmanddon’tmakeitworse!•  Poorlyskilledoperatorsmayturnaproblemintoacatastrophe

2.  ContactRedHatSupportimmediately•  Sev1andSev2issuesarehandledwithtoppriority•  Chancesarethattheywillbeabletohelprightawayandgetyourcluster

hummingagain

3.  ContactyourtrustedRedHatServicesorSalescontacts•  Ifproblemspersistoryoufeelyouneedextrahelp,youmightwanttogeta

CephexpertfromRedHatProfessionalServices

Page 15: Red Hat Storage Day -  When the Ceph Hits the Fan

GOOD PRACTICES TO AVOID PROBLEMS

15

1.  Don’tstumbleintoimplementa?on/deploymentwithoutcarefulplanning•  Captureanddocumentrequirements,doaPOC,doanactualdesign•  Engageexpertsearlytohelpwithclusterdesignandhardwarechoices

2.  Unlessyoulovetotakerisks,usesupportedbits3.  StayclosetotherecommendedreferencearchitecturesfromRedHatpartners4.  Makesureyourstaffreceivespropertraining

•  RedHatGlobalLearningprovidesexcellenttrainingforGlusterandCeph5.  Planforgrowth6.  Don’tletthingslinger.Cephdoesnotlikeitwhentheclusteris90%full7.  HaveanexpertperformregularStorageHealthCheckstodetectproblemswhile

theyares?llsmall

Page 16: Red Hat Storage Day -  When the Ceph Hits the Fan

STORAGE DESIGN CONSULTING

16

•  SpecialistsfromRedHatConsul?ngwillhelpplanningyourCephdeployment

•  Start:StorageDiscoverySession

•  Wecanhelpdiscoverrequirementsanddesignastoragesolu?onthatmatches

•  YouwillreceiveadetailedStorageSolu,onarchitecturedocumentwhichwillar?culatedesignchoicesandlayoutastep-by-stepplanforimplementa?on

Page 17: Red Hat Storage Day -  When the Ceph Hits the Fan

STORAGE HEALTH CHECKS

17

•  Standard3-dayengagementdonebyRedHatstorageexperts•  Comprehensivetop-to-bojomanalysisofyourso"ware-definedstorageplaeorm•  Sixfocusareas

1.  Lifecycle2.  Configura?on3.  Organiza?on4.  UseCase5.  Hardware6.  Opera?onal

•  Clearread-outofissues•  Ac?onablerecommenda?ons

Page 18: Red Hat Storage Day -  When the Ceph Hits the Fan

POSITIVE NOTE

18

•  Iaskedmyconsultantsforfeedbackonthispresenta?on.Hereisonecomment

Page 19: Red Hat Storage Day -  When the Ceph Hits the Fan

19

WHERE TO GO NEXT

REDHATSUBSCRIPTIONS

hjps://access.redhat.com/subscrip?on-valueEvalua?on,Pre-produc?on,andProduc?onsubscrip?onsavailable

CONSULTING hjp://www.redhat.com/en/services/consul?ng/storage

TRAINING hjps://www.redhat.com/en/services/training

TESTDRIVE hjp://red.ht/cephtestdrive

To engage a Territory Service Manager in your area, ask for a local Red Hat Storage sales professional at: NORTH AMERICA: 1 (888) REDHAT-1; LATIN AMERICA: 54 (11) 4329-7300; EMEA: 00800 7334 2835 APJ: 65 6490 4200; Brazil: 55 (11) 3529-6000,; Australia: 1800 733 428; New Zealand: 0800 733 428