Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad Carlile

41
Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Spark SQL: Another 16x faster aFer Tungsten SPARC processor has drama1c advantages over x86 on Apache Spark Brad Carlile Sr. Director SoluKon Architecture Engineering SAE Room 311, Wednesday, Feb 8, 2:40 pm–2:55 pm Feburary 7, 2017 Addi$onal info on Proof points: hXp://blogs.oracle.com/bestperf

Transcript of Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad Carlile

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SparkSQL:Another16xfasteraFerTungstenSPARCprocessorhasdrama1cadvantagesoverx86onApacheSpark

BradCarlileSr.DirectorSoluKonArchitectureEngineeringSAERoom311,Wednesday,Feb8,2:40pm–2:55pmFeburary7,2017Addi$onalinfoonProofpoints:hXp://blogs.oracle.com/bestperf

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SafeHarborStatementThefollowingisintendedtooutlineourgeneralproductdirecKon.ItisintendedforinformaKonpurposesonly,andmaynotbeincorporatedintoanycontract.Itisnotacommitmenttodeliveranymaterial,code,orfuncKonality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment,release,andKmingofanyfeaturesorfuncKonalitydescribedforOracle’sproductsremainsatthesolediscreKonofOracle.

2

SPARCDAXinApacheSparkisProof-of-conceptandnotproduct

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SparkisExciKng–Lotsofso=wareinnova$onSpark’sWholeEcosystemtotallyappealstotheperformanceexpertinme

• GreatEcosystemforAnalyKcs– SparkformsaconKnuumwithothertechnologies(Ka`a,Solr,…)– LotsofOracle’scustomerusingSpark

•  Sparkisfast&scalablebecauseofcleandesign– Spark’sin-memoryfocus

•  SparkspeaksSQL&DataFrames– PerfectmarriageforDataScienKsts&HardwareAcceleraKon

– …butwhatisthestatusofprocessorperformance?3

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

ModernAnalyKcsisaboutYourData&ExternalData

• AnalyKcsisnowyourdataPLUSexternaldata– OracleDatabaseIn-MemoryFastestonSPARC– ConKnuouslyanalyzingsamedataindifferentways–Fastestifstoredatain-memory

• BigDatacanbeusedtodescribeexternaldata– Socialnetworks,publicdata:senKments,weather,traffic,events,raKngs,trends,IoTsensors,behaviors,…

– LotsofETL/filteringneededtofindusefuldata– BigDataSQL

ConfidenKal–OracleInternal/Restricted/HighlyRestricted 4

BeBerinforma1onwhenyoucananalyzeallrelevantdata

Weather

Tweets

ExternalData

Order

Customer

YourData

SensorsSocial

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

0.8x

1.0x

1.2x

1.4x

1.6x

1.8x

2.0x

2012 2013 2014 2015 2016

CorePerform

ance

vs.x86E5v2

WhyNOTUpHERE?

Isitreallylatency?

X86percoreperformanceflatfor4genera1ons

ISTHERECOMPUTEBOTTLENECK?REALLY?

MITTechReview,“ChipmakerIntelhassignaledaslowingofMoore’sLaw,…whichplayedaroleinjustabouteverymajoradvanceinengineeringandtechnologyfordecades.”

5

x86Innova1onsinPerCorePerformancehasStalled

"percore=(serverperformance)/(servercorecount)"

E5 E5v2 E5v3 E5v4

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

0.8x

1.0x

1.2x

1.4x

1.6x

1.8x

2.0x

2012 2013 2014 2015 2016

CorePerform

ance

vs.x86E5v2

x86:flatforfourgenera1ons

SPARC’sInnovaKonsareConKnuallyIncreasingCorePerformance

•  SPARChasfastercoresforDatabase,Java,Apps,...•  SPARChasmorecoresperchipwhichalsodrivesefficiency

6

Innova1veHWDesigngetsaround“x86boBleneck”

SPARCM7 S7

SPARCS7

SPARC:DB

&Java/JVM

S7

"percore=(serverperformance)/(servercorecount)"

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

0.8x

1.0x

1.2x

1.4x

1.6x

1.8x

2.0x

2012 2013 2014 2015 2016

CorePerform

ance

vs.x86E5v2

Java-JVMOLTPDBMeoryBandwidthGB/s

SPARCS7is1.6xto2.1xfasterthanx86

It’smoreimportanthowyouusetransistors,thanthenumbertransistorsyoumake(Moore’sLaw)

7

Innova1veHWDesigngetsaround“x86boBleneck”,x86percoreperfstalled

x86indashedlines

SPARCins

olidlines

E5E5v2

E5v3 E5v4

SPARCT5

SPARCM7S7

SPARCS7

"percore=(serverperformance)/(servercorecount)" "percore=(serverperformance)/(servercorecount)"

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPECjbb2015MulKJVMSPARCS7

•  InnovaKonsinSPARCprocessorinnovaKonsimproveJavaandJVMperformance– x86performanceisflatgeneraKon-to-generaKon•  TradeoffbetweentuningforMaxjOpsorCritjOpsonx86

8

SPARCS7core1.5xto1.9xfasterthanx86E5v4coreonMaxjOps/core

0

1,000

2,000

3,000

4,000

5,000

6,000

S7-2 Huawei HP HP Cisco HuaweiLenovo

CritjOps/coreMaxjOps/core

x86E5v4 x86E5v3

"percore=(serverperformance)/(servercorecount)”Seedisclosureslide

Processor chip,core

MaxjOPS

CritjOPS

SPARCS7-2 4.27GHzSPARCS7 2,16 65,790 35,812

HuaweiRH2288Hv3 2.2GHzE5v422c 2,44 121,381 38,595CiscoC220 2.2GHzE5v422c 2,44 94,667 71,951HuaweiRH2288Hv3 2.3GHzE5v318c 2,36 98,673 28,824Lenovox240M5 2.3GHzE5v318c 2,36 80,889 43,654

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

Java(JVM)istheLanguageoftheCloud&FOSS

JVM=JavaVirtualMachine-Run$meORACLECONFIDENTIAL-HighlyRestricted

Java&theunderlyingJVMisthedesigntargetforSPARCBigData&DevOps Descrip1on WriBentoOracleFusion EnterpriseApps Java

ApacheSpark In-memoryAnaltyics JVMApacheCassandra NoSQLdatabase Java

ApacheHadoop DisktodiskMapReduce Java

ApacheHDFS filesystem Java

ApacheLucene DoctextIndexing Java

ApacheSolr text/doc/websearch Java

Jersey/Grizzly(REST) RESTInterface Java

ApacheHive SQLonHDFS Java

Neo4j Graph Java

Scala Language JVM

BigData&DevOps Descrip1on WriBentoAkka(REST) RESTforBigData JVM

ApacheAccumulo key/valuestore Java

ApacheFlink Batch&StreamProcessing Java(JVM)

ApacheKa`a StreamingMessageBroker JVM

Apachelog4j Logeventmanager Java

ApacheSamza Streamprocessing JVM

ApacheStorm Streamprocessing Java&Clojure

ApacheYarn Clusterjobmanager Java

ApacheZookeeper Config&groupservices Java

H20.ai MLAppslibraries Java,Python,R

ApacheHbase NoSQLdatabase Java

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

CassandraNoSQL2.2.6(YahooCloudServingBenchmark)SPARCcoreis2.0xfasterthanE5v3core

•  SPARCS72.0xfasterpercorethanE5v3Haswell•  SPARC’sJava/JVMadvantagesbenefitCassandraperformance– YCSB–YahooCloudServingBenchmark–300M• MixedLoadWorkloadA(50%Read/50%Update)• Moreinfoat:hXp://blogs.oracle.com/bestperf

10

Cassandra Chip,core Processor AveOps/s KOps/s

percoreSPARCpercoreAdvantage

SPARCS7-2 2,16 4.27SPARCS7 69,858 4.4K 2.0xE5v3Haswell 2,36 2.3E5v3 90,184 2.5K 1.0x

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

OracleDatabasealsoOpKmizedforAnalyKcQueriesApacheSparkneedstoimplementtheseop$miza$ons!

• ColumnStoreisopKmalforAnalyKcs– AnalyKcsLooksacrosstransacKonsatspecificcolumns– DatabasecolumnsareMachineLearningfeatures

• Columnsfarmoreefficientwhenanalyzingfeatures(efficientbandwidth&CPUneeds)• CanexploitdatacharacterisKcs>20xbenefit

•  StorageopKmizaKons–datarepeatedlyanalyzed– VectorprocessingofDicKonary-encodedcomparesneedstobeaddedtoinSparkpersisKngDataFramesinmemory

–  layeredcompressionmeans10’sTBofdatafitintoTB’sofmemory•  ProcessingopKmizaKons– BloomFilterjoinprocessing,operatorpushdown,metadataopKmize

11

SQLop1mizedforanaly1cqueries BigFeatureRevoluIon

#featurescollectedisExploding10’sto100’s

Order F1 F2 … F150

Customer F1 F2 … F78 … F100

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

OracleDB

OracleDatabaseIn-MemoryAtScale&SPARCDAX

• OLTPusesprovenrowformat• AnalyKcsusein-memoryColumn

•  10xfasteranalyKcsduetosoFware•  Columnarcompressionmeanshugedatabasescannowfitin-memory

• OracledatabasestoresBOTHrowandcolumnformats•  SimultaneouslyacKveandtransacKonallyconsistent

• 10xto20xFasterSPARCDAX(DataAccelerator)HWinnovaKon

12

RowFormat ColumnFormat

CachedOLTP

TransacKonsWholeDBInMemory

Compressed

WholeDBInMemory

Compressed

10-100’sTBytesRowstore

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPARCDramaKcallyFasterIn-MemorySQLAnalyKcsinDB

•  SPARCS79.4xfasterpercorethanx86E5v4– SPARCS7-2422query/mvs2-chipx8656query/m•  S7-2(18-corestotal),2-chipx86E5v4(20-corestotal)

– x86percoreperformanceflatfor4generaKons

• DAXoffloadsIn-MemoryScans– DAXoffloadallowscorestoincreasethroughput

•  160GBinmemoryrepresents1TBDBondisk•  6.2xcompressionwithOracle12.2In-memory

13

SPARCismul1genera1onalleadinperformance:14Billion-row/secpercore

In-MemorySQLAnaly1cs

9.4xfasterSPARCcoreadvantage

E5v4x86

SPARCOffloadDAXfrees

Coresforotherprocessing

6xCompression

6xIMcapacity

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPARC’sMulK-generaKonalLeapfroginPerformance

•  IntegratedOffload– DataAnalyKcsAcceleraKon(DAX)– EncrypKon&Security

•  It’smoreimportanthowyouusetransistors,thanMoore’sLaw(thenumbertransistorsyoumake)

RadicalInnova1on:IntegratedOffloadoffers10xfasterperformance!

OthervendorsfocusedonDetachedGPUs•  DetachedGPUsarepoorlydesignedforSQL•  SlowGPUinterconnec$onrobsperformance

Memory

HalfBW

Memory

X86IBMPower

100%U1lizedNOOFFLOAD!NOOPENCores!

Band-Width

SPARCS7

DAXOFFLO

AD O

FFLOADDAX

14

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

GraphicProcessingUnits(GPUs)GPUsOnlyHelpVeryComputeIntensiveAlgorithmsComplexML/StaIsIcsoRennotfastonGPUs…“Canonlycomputeasfastasmovedata”• DetachedGPUshavelimitedbandwidth– GPUsfitforlotsofsimplegraphics•  GraphicshaveKnydatamovement:Coords&texturesin,videoframeout

– BigDataMLmuchmoredemanding:GPUshavelimitedapplicaKons• MLLearning&Scoring– SomeMLLearningcanbecomputeintensive,butonlydoneiniKallytocreateModel

•  NvidiaMathLibrariesonlyhaveBLAS3rouKnes– BLAS2andBLAS1doNOThavecomputeintensitythereforearenotinlibrary

– ComplicatedMLhasmixofBLAS1,BLAS2,BLAS3 BandwidthlinestoscaleGPUcomputeoRenstalledbylackofbandwidth

Memory

Chip

Local169GB/s

Local57GB/s

E5v422core

Local57GB/s

E5v422core

GPU GPU

CurrentPCIe FutureNVlink

SPARCM732core

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SQL&theManyFlavorsofDSL–AllpotenKalsforDAX

•  SQL:– SELECTcount(*)frompersonWHEREci$zen.age>18

• ApacheSparkSQL(DSL)– valvoters:Int=ci$zen.filter($”ci$zen.age”>18).count()

•  JavaStreams– intvoters=arrayList.parallelStream().filter(ciKzen->ciKzen.olderThan(18)).count();

• ApacheEclipse(GoldmanSachsCollecKons)– intvoters=fastList.count(ciKzen->ciKzen.olderThan(18));

ORACLECONFIDENTIAL-HighlyRestricted

SQLcanbewriBeninDomainSpecificLanguages(akaLanguageIntegratedQueries)

ApacheSparkfeedsbothformatsintoSQLopImizerJoinscanbewrinen&op$mized

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPARCDAXIntegratedwithJavaJDK8Streams

•  JavaStreamsprovideSQL-stylecodeinJava– JavaStreamsisaperfectfitforDAXacceleraKon

• DAX&JavaStreams– IntegerStreamfilter,allMatch,anyMatch,noneMatch,map(ternaryoperator),toArray&countfuncKons– SimplecodechangestouseDAX•  importcom.Oracle.Streaminsteadofimportjava.uKl.Stream•  DaxIntStreaminsteadofIntStream

• Publicaccess:hXp://SWiSdev.oracle.com/DAX

ORACLECONFIDENTIAL-HighlyRestricted

DAXacceleratesJavaStreams:3.6xto21.8xfaster

3.6x 4.0x 4.1x

10.7x

21.8x

Outlier %Kle Top-N Filter allMatch

SpeedupwithDAX10MillionRows

JavaStreamAPI:filter2_data.parallelStream().filter(w->w.temp<100).count()

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

ApacheSpark:SQLandDSLBothOpKmizedbyCatalystOpKmizer

18

Ex:“SelectallbooksbyauthorsbornaRer1980named‘Paulo’frombooks&authors”SQLSELECT*FROMauthoraJOINbookbONa.id=b.author_idWHEREa.year_of_birth>1980ANDa.first_name='Paulo’ORDERBYb.Ktle

QueryDSL(DomainSpecificLanguage)valjoinDF=author.as(‘a).join(book.as(‘b),$”a.id”===$”b.author_id”).filter($”a.year_of_birth”>1980).filter($”a.first_name”=“Paulo”)).orderby(‘b.Ktle)

CatalystOp1mizerOp$mizedPlan

(canIreordervariousopera$ons?)Whole-StageCodeGen

newinSpark2.0,thisisProjectTungsten

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

ApacheSpark2.1Whole-stageCodeGenPerformance

•  Let’sevaluaKonperformanceonFullTableScanof600Mrows– TimetoScan=0.16secImpressive:104.2MillionRows/secpercore

19

2-chipx86E5v3(Haswell),total36cores,72threads/VCPU

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

ByEvaluaKngDeliveredBandwidth,WecanseewearesKllNotYetEfficient

•  EvaluaKonperformanceonFullTableScanof600Mrows– TimetoScan=0.16secImpressive:104.2MillionRows/secpercore

• Goingbackto1stprinciples:Scanningisaboutdatamovement– Whatisbandwidthof2.1in-memoryScan? 14GB/s(Spark2.0.1)– Whatisthesystemmemorybandwidth? 114GB/s(StreamTriad)– 7.6xmorebandwidthavailable!

• OtherQueriesdeliverless:“SELECTcount(*)FROMlineorderWHERElo_quanKtyBETWEEN10and20”

– 19xmorebandwidthavailable!--BUTonlydelivering6GB/sonthesystem

20

2-chipx86E5v3(Haswell),total36cores,72threads/VCPU

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

AFerSpark2.1:Another10xinPerformanceSKllPossibleSelectcount(*)fromstore_saleswheress_item_sk>100andss_item_sk<1000

1.  VolcanoIteratormodel:planinterpretaKon–  Open;Nextdataelement;performpredicate;Close

2.  “CollegeFreshman”Java:Whole-StageCodeGenpipelinedcode–  for(ss_item_skinstore_sales){if(ss_item_sk>100andss_item_sk<1000){count+=1}}

3.  TunedtruevectorizaKonlibrary:highly-tunedcodeoperatesonconKguouscolumn–  vectorRangeFilter(n,VECTOR_OP_GT,1000,VECTOR_OP_LT,1000,store_sales,result,result_cnt)

4.   HardwareAccelera1onfurtheracceleratesscanning–  vectorRangeFilter(n,VECTOR_OP_GT,1000,VECTOR_OP_LT,1000,store_sales,result,result_cnt)•  x86AVX2(GraphicsInstrucKon)– deliver70GB/sperchip•  Oracle’sSPARCM7processor– deliver140GB/sperchip

21

ProjectTungsten

SPARCDAX

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPARCDAXAcceleraKngApacheSpark2.1.0

• Proof-of-ConceptPrototypeonSPARCS7:ApacheSpark2.1POC:DAX&in-memorycolumnar– 2-chipSPARCS7DAX:9.8Billionrows/secon2-predicatescan•  SPARCS7-2(16totalcores) 0.061sec•  2-chipx86(20totalcores)0.760seclatestgenera$onE5v4Broadwell

– TheseadvantagesareoverandaboveTungsten’simprovements

•  SPARCDAXoffloadscriKcalfuncKons– SQL:filtering,dicKonaryencoding,1&2predicate,joinprocessing,addiKonalcompression,etc.

•  LibdaxopenAPI:hXp://SWiSdev.oracle.com/DAX

22

SparkSQLonSPARCS7-2DAXover15.6xfasterpercorethanx86

x86 SPARCDAX

15.6xfasterpercorethanx86

SparkOn

SPARCS7

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

In-MemoryColumnarTwo-levelCompression:OracleDBSametechniqueneedstobeaddedtoApacheSpark

•  EfficientInMemoryColumnarTable• DicKonaryencodinghugecompression–  50USonlyneed6bits(<1byte)vs.“SouthDakota”needs12charactersor192unicodebits=32xsmaller

–  Ozip/RLEcompressionisthenappliedontopofdicKonaryencoding

•  Innova1onsthatApacheSparks1llneeds–  Directlyscandic$onaryencodeddata!–  Rangescans(2-predicate)inonestep–  Ozipdecompressionwithoutwri$ngtomemory–  SaveMin&MaxforscaneliminaKon–  CanusedicKonaryfor“featurizaKon”ofdataforML

23

0132023

ColumnMin:SouthDakotaMax:Utah

DicIonary

Dictencode

Columnvaluelist

SouthDakota

Tennessee

Utah

Texas

SouthDakota

Texas

Utah

DicIonaryVALUEID

SouthDakota0Tennessee1Texas2Utah3

OZip+

RLE)

Capa-cityLow(Ozip)

Verizonre-inventedthisaswell(SparkSummit2017Boston):RealKmeAnalyKcalQueryProcessing&PredicKveModelDebasishDas(Verizon)

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

libdaxOpenAPIforFree&OpenSourceSoFware(FOSS)

• DesignedtoacceleratewidevarietyofOracle&FOSSsoFware– Examples:• OracleDBIn-memory9.5xfasterwithDAX• ApacheSparkSQLover16xfasterwithDAX• AcKveViam6xto8xfasterwithDAX•  JavaStreams3.6xto21.8xfasterwithDAX

• OpenAPIforlibdax–signupforfreesystemstoactuallydevelop/trycode– Publishedsamplecodes– hXps://swisdev.oracle.com

Libdaxdesignedforkeyscan&dic1onaryopera1onstoaccelerateavarietyofsoyware

24

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPARCProcessorisfastestforStaKsKcs&ML(MachineLearning)SPARCisFastestforSparkApplica1onsSPARCDAXcandrama1callyaccelerateSparkSQL

ConfidenKal–OracleInternal/Restricted/HighlyRestricted 25

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

TheBasicAnalyKcsFlow

• Datacancomefrommanysources– Databases,NoSQL,csv,feeds…

• Weneedtoprepareit– DataMungingofallsorts!

• Analyzethedata– Findthe“rightway”toanalyzeit• ML,Graph,SQL…

ConfidenKal–OracleInternal/Restricted/HighlyRestricted 26

Alotof1mespentinDataMunging

DataMunging

RESULTS

Analy1cs

NoSQL,Search.…

StreamingKaza,Storm…

Databases

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

ConKnuousAnalyKcsCycleResultsOFenEnrichTransacKonsAnaly1csismorethanonepipelinestream

• ConKnuousiteraKonsaroundthedataanalyKcswheel– Save,catalog,andre-useallthings:data,SQL,code,andanalyKcs

•  In-memoryadvantagesateachstage– SPARC’sDAX&leadingbandwidthiskey

• Manysourcesofdata– Internalproprietary,publicdata,externalstreaming,archives

OracleConfidenKal–HighlyRestricted 27

StreamingKaza,Storm,…

EnhanceTransacIons

Reports

NoSQL

ETL(SQL)

ETL(SQL)

In-Memory:Reusedata&analysis

ML&Graph(FP)

Results(SQL)

FeatureExtract,Generate&Transform

(SQL)

Databases

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

ComputaKonalGraphAlgorithms:PageRank&Single-SourceShortestPath(SSSP)

• GraphcomputaKonsacceleratedbySPARC’smemorybandwidth– Bellman-Ford/SSSP(single-sourceshortestpath)–opKmalrouteorconnecKon– PageRank-measuringwebsiteimportance

Graph:SPARCM7upto1.5xfasterpercorethanx86GraphAlgorithm Workload

Size4-chip

X86E5v34-chip

SPARCT7-4SPARCperchipAdvantage

SPARCpercoreAdvantage

SSSPBellman-Ford

448MverKces,17.2Bedges 39.2s 14.7s 2.7x 1.5x233MverKces,8.6Bedges 21.3s 8.5s 2.5x 1.4x

PageRank448MverKces,17.2Bedges 136.7s 62.6s 2.2x 1.2x233MverKces,8.6Bedges 72.1s 27.6s 2.6x 1.5x

28"percore=(serverperformance)/(servercorecount)"

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

OracleDatabaseAdvancedAnalyKcsOpKonMachineLearningonSPARC

•  OracleAdvancedAnalyKcsinOracleDatabase12.2–  SPARCM7fasterpercoreontraining64-bitfloaKngpointintensive–  In-memory640millionrecords,AirlineOn-Kmedataset

MLtrainingSPARCM7upto2.0xfasterpercorethanx86

SGD(StochasKcGradientDescent)IPM(InteriorPointMethod)

Training:CreaIngModelfromdata

ABri-butes

X5-44-chip

T7-44-chip

SPARCperchipAdvantage

SPARCpercoreAdvantage

Supervised

SVMIPMSolver 900 1442s 404s 3.6x 2.0xGLMClassificaKon 900 331s 154s 2.1x 1.2xSVMSGDSolver 9000 157s 84s 1.9x 1.1xGLMRegression 900 78s 55s 1.4x 0.8x

ClusterModelExpectaKonMaximizaKon 9000 763s 455s 1.7x 0.9xK-Means 9000 232s 161s 1.4x 0.8x

29"percore=(serverperformance)/(servercorecount)"

Predict Train

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

OracleDatabaseAdvancedAnalyKcsOpKonMachineLearningonSPARC

•  OracleAdvancedAnalyKcsinOracleDatabase12.2–  SPARCM7muchfasterpercoreonscoringbandwidth-intensive–  In-memory1Billionrecords,AirlineOn-Kmedataset

MLpredic1onSPARC3.0x-4.8xfasterpercorethanx86

SGD(StochasKcGradientDescent)IPM(InteriorPointMethod)

Predic1onUsingmodelon1Brecords

ABri-butes

X5-44-chip

T7-44-chip

SPARCperchipAdvantage

SPARCpercoreAdvantage

Supervised

SVMIPMSolver 900 206s 24s 8.6x 4.8xGLMRegression 900 166s 25s 6.6x 3.7xGLMClassificaKon 900 156s 25s 6.2x 3.5xSVMSGDSolver 9000 132s 24s 5.5x 3.1x

ClusterModelK-Means 9000 222s 35s 6.3x 3.6xExpectaKonMaximizaKon 9000 243s 40s 6.1x 3.4x

30"percore=(serverperformance)/(servercorecount)"

Predict Train

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SparkMLDataFrameAlgorithmsuseBLAS1&someBLAS2

•  ALS(AlternaKngLeastSquaresmatrixfactorizaKon–  BLAS2:_SPR,_TPSV–  BLAS1:_AXPY,_DOT,_SCAL,_NRM2

•  LogisKcregressionclassificaKon–  BLAS2:_GEMV–  BLAS1:_DOT,_SCAL

•  Generalizedlinearregression–  BLAS1:_DOT

•  Gradient-boostedtreeregression–  BLAS1:_DOT

•  GraphXSVD++•  BLAS1:_AXPY,_DOT,_SCAL

•  NeuralNetMulK-layerPerceptron•  BLAS3:_GEMM•  BLAS2:_GEMV

31

Lowercomputeintensityof0.66to2.0,BLAS1/BLAS2limitsMLperformance,BLAS3faster

SparseLinearAlgebrahasEvenlowercomputeintensity

“Onecanonlycomputeasfastasonecanmovedata”MaxFlops=Compute-intensity*(Memory-BW/bytes-element)Compute-Intensity=Flops/elementSingleFP=4bytes/elementDoubleFP=8bytes/element

HINT:SparkDevelopersneedtowritesub-blockmatriximplementa$onstogetmoreBLAS3usageinSparkML…Why?BLAS3aremoreefficiently,becausetheyuselessmemorybandwidthandreducenumberofmethodcallsFastestLAPACKsolversuseBLAS3

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

GPUsOnlyAccelerateSomeComputeKernelsDetachedGPUmemorybandwidthisboBleneckformostcomputa1ons

Rou1neLevel Opera1on Flops MemComputeIntensity

TeslaK80peak

MaxPossibleGflops

@12GB/sblock=100

PCIeGPU

%peak

MaxPossibleGflops

@80GB/sblock=100

Nvlink(future)GPU

%peak

BLAS1(vector) y=ax+y 2n 3n 2/3 935Gflops 1Gflops 0.1% 7Gflops 0.7%

BLAS2(matrix-vector) y=Ax+y 2n^2+2n n^2 2 935Gflops 3Gflops 0.3% 20Gflops 2.1%

BLAS3(matrix-matrix) C=AB+C 2n^3 4n^2 n/2 935Gflops 75Gflops 8.0% 500Gflops 53.5%

FFT C=FFT(A) nlog(n) 2n Log(n)/2 935Gflops 239Gflops1M1D-FFT 25% 935Gflops

1M1D-FFTNearpeak

OracleConfidenKal–HighlyRestricted 32

Only6BLAS3rouKnetypesinNvidiaLibrary:gemm,syrk,syr2k,trsm,trmm,symmCanwritecodesusingBLAS2/1inGPUsbutresulKngcomputeintensitymustbelargetoaccelerate

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPARCAdvantagesonAnalyKcsAcrosstheSpectrumKeyunderlyingtechnologiesprovidedrama1cperformanceadvantages

SPARC’s Java & JVM advantages

SQL DAX Offload

SPARC DAX offload acceleration Oracle Developer Perflib math libraries

NoSQL Machine Learning Graph & Spatial

•  Oracle Database •  Big Data with DAX

•  Cassandra NoSQL •  Oracle NoSQL

•  Spark MLlib •  Advanced Analytics

•  Spark GraphX •  Oracle Spatial&Graph

upto10xfasterpercore

upto5xfasterpercore

upto1.5xfasterpercore

upto1.9xfasterpercore

33

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

RequiredBenchmarkDisclosureStatement•  AddiKonalInfo:hXp://blogs.oracle.com/bestperf•  Copyright2016,Oracle&/oritsaffiliates.Allrightsreserved.Oracle&JavaareregisteredtrademarksofOracle&/oritsaffiliates.OthernamesmaybetrademarksoftheirrespecKveowners•  SPECandthebenchmarknameSPECjEnterpriseareregisteredtrademarksoftheStandardPerformanceEvaluaKonCorporaKon.Resultsfromwww.spec.orgasof7/6/2016.SPARC

S7-2,14,400.78SPECjEnterprise2010EjOPS(unsecure);SPARCS7-2,14,121.47SPECjEnterprise2010EjOPS(secure)OracleServerX6-2,27,803.39SPECjEnterprise2010EjOPS(unsecure);IBMPowerS824,22,543.34SPECjEnterprise2010EjOPS(unsecure);IBMx3650M5,19,282.14SPECjEnterprise2010EjOPS(unsecure).

•  SPECandthebenchmarknameSPECjbbareregisteredtrademarksofStandardPerformanceEvaluaKonCorporaKon(SPEC).ResultsfromhXp://www.spec.orgasof6/29/2016.SPARCS7-2(16-core)65,790SPECjbb2015-MulKJVMmax-jOPS,35,812SPECjbb2015-MulKJVMcriKcal-jOPS;IBMPowerS812LC(10-core)44,883SPECjbb2015-MulKJVMmax-jOPS,13,032SPECjbb2015-MulKJVMcriKcal-jOPS;SPARCT7-1(32-core)120,603SPECjbb2015-MulKJVMmax-jOPS,60,280SPECjbb2015-MulKJVMcriKcal-jOPS;HuaweiRH2288Hv3(44-core)121,381SPECjbb2015-MulKJVMmax-jOPS,38,595SPECjbb2015-MulKJVMcriKcal-jOPSHPProLiantDL360Gen9(44-core)120,674SPECjbb2015-MulKJVMmax-jOPS,29,013SPECjbb2015-MulKJVMcriKcal-jOPS;HPProLiantDL380Gen9(44-core)105,690SPECjbb2015-MulKJVMmax-jOPS,52,952SPECjbb2015-MulKJVMcriKcal-jOPS;;CiscoUCSC220M4(44-core)94,667SPECjbb2015-MulKJVMmax-jOPS,71,951SPECjbb2015-MulKJVMcriKcal-jOPS;HuaweiRH2288HV3(36-core)98,673SPECjbb2015-MulKJVMmax-jOPS,28,824SPECjbb2015-MulKJVMcriKcal-jOPs;Lenovox240M5(36-core)80,889SPECjbb2015-MulKJVMmax-jOPS,43,654SPECjbb2015-MulKJVMcriKcal-jOPS;SPARCT5-2(32-core)80,889SPECjbb2015-MulKJVMmax-jOPS,37,422SPECjbb2015-MulKJVMcriKcal-jOPS;SPARCS7-2(16-core)66,612SPECjbb2015-Distributedmax-jOPS,36,922SPECjbb2015-DistributedcriKcal-jOPS;HPProLiantDL380Gen9(44-core)120,674SPECjbb2015-Distributedmax-jOPS,39,615SPECjbb2015-DistributedcriKcal-jOPS;HPProLiantDL360Gen9(44-core)106,337SPECjbb2015-Distributedmax-jOPS,55,858SPECjbb2015-DistributedcriKcal-jOPS;HPProLiantDL580Gen9(96-core)219,406SPECjbb2015-Distributedmax-jOPS,72,271SPECjbb2015-DistributedcriKcal-jOPS;LenovoFlexSystemx3850X6(96-core)194,068SPECjbb2015-Distributedmax-jOPS,132,111SPECjbb2015-DistributedcriKcal-jOPS.

•  SPECandthebenchmarknamesSPECfpandSPECintareregisteredtrademarksoftheStandardPerformanceEvaluaKonCorporaKon.ResultsasofOctober25,2015fromwww.spec.organdthisreport.1chipresultsSPARCT7-1:1200SPECint_rate2006,1120SPECint_rate_base2006,832SPECfp_rate2006,801SPECfp_rate_base2006;SPARCT5-1B:489SPECint_rate2006,440SPECint_rate_base2006,369SPECfp_rate2006,350SPECfp_rate_base2006;FujitsuSPARCM10-4S:546SPECint_rate2006,479SPECint_rate_base2006,462SPECfp_rate2006,418SPECfp_rate_base2006.IBMPower710Express:289SPECint_rate2006,255SPECint_rate_base2006,248SPECfp_rate2006,229SPECfp_rate_base2006;FujitsuCELSIUSC740:715SPECint_rate2006,693SPECint_rate_base2006;NECExpress5800/R120f-1M:474SPECfp_rate2006,460SPECfp_rate_base2006.

•  Two-KerSAPSalesandDistribuKon(SD)standardapplicaKonbenchmarks,SAPEnhancementPackage5forSAPERP6.0asof5/16/16:SPARCM7-8,8processors/256cores/2048threads,SPARCM7,4.133GHz,130000SDUsers,713480SAPS,Solaris11,Oracle12cSAP,CerKficaKonNumber:2016020,SPARCT7-2(2processors,64cores,512threads)30,800SAPSDusers,2x4.13GHzSPARCM7,1TBmemory,OracleDatabase12c,OracleSolaris11,Cert#2015050.HPEIntegritySuperdomeX(16processors,288cores,576threads)100,000SAPSDusers,16x2.5GHzIntelXeonProcessorE7-8890v34096GBmemory,SQLServer2014,WindowsServer2012R2DatacenterEdiKon,Cert#2016002.IBMPowerSystemS824(4processors,24cores,192threads)21,212SAPSDusers,4x3.52GHzPOWER8,512GBmemory,DB210.5,AIX7,Cert#201401.DellPowerEdgeR730(2processors,36cores,72threads)16,500SAPSDusers,2x2.3GHzIntelXeonProcessorE5-2699v3256GBmemory,SAPASE16,RHEL7,Cert#2014033.HPProLiantDL380Gen9(2processors,36cores,72threads)16,101SAPSDusers,2x2.3GHzIntelXeonProcessorE5-2699v3256GBmemory,SAPASE16,RHEL6.5,Cert#2014032.SAP,R/3,regTMofSAPAGinGermanyandothercountries.Moreinfowww.sap.com/benchmarks.

MustbeinSPARCS7&M7Presenta1onswithBenchmarkResults

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

BackupSlides

35

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

WhyApacheSparkisbestonSPARC

•  SPARC’sleadershipinfastJVMperformancespeedsScala– SPARCdesignfocusonJava&JVMbenefitsallAppsthatuseJVM

•  SPARC’shugeleadindeliveredmemorybandwidth– SPARCacceleratesSpark’sin-memorydesign–Bigdataappsneedsfastdatamovement

•  SPARCDAXtoprovidehugeoffloadacceleraKoninApacheSpark– ContribuKngcodetoApacheSparkSQL/DataFramestouseDAX

•  SPARC’sleadingfloaKng-pointperformanceforMachineLearning– DesignedforfastFloaKngPoint(FP)onsparse&advancedalgorithm

Designinnova1onsateverylevel

36

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

SPARCAdvantagesforEveryFormofAnalyKcs

AnalyIcs In-memoryColumnarSW&SoRwareinSilicon(SPARCDAX)

SPARC’sefficientCPU&memorydesign

DatabaseAnaly1csAnaly$cSQLQueriesfordatabase ✔ ✔JavaAnaly1csJavaCodethatprobesdataSQL-likecode ✔ ✔MachineLearning(ML)Automa$callyfindinghiddenpanerns&correla$ons ✔ ✔GraphAutoma$callyfindpanernsinSocialnetworks,etc. ✔NoSQLFastaccesstosemi-structured&unstructureddata ✔Spa1alFastthroughputongeometricdata ✔

37

SPARCis1.3xtoover10xfasterpercoreonAnaly1csthanx86

Featuregenera$on&Selec$onforMLiso=enSQL

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

0.0

50.0

100.0

150.0

200.0

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Billion

sRow

s/sec

Bitstoencodedata:log2(cardinality)

RangeScanUsedinSQL“Between”

S7-1DAX8-coreS7-2DAX16-coreE5v3AVX218-core

SPARCDAXScanPerformanceversusx86E5v3(AVX2)DAXoffloadscans:SPARCS7onlyuses25%coresandx86uses50%cores

38

18-corex86E5v3

8-coreSPARCS7

OraclehasopenAPIforDAX:hXps://swisdev.oracle.com/

16-coresSPARCS7-2

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

“Big”Features:AXributes/QualiKestoAnalyze

• OrganizingfeaturesiscriKcaltotheanalyKcscycleConfidenKal–OracleInternal/Restricted/HighlyRestricted 39

BigDataisalsoBigFeatures:10’sto100’sfeaturestochose,create,&manage

StaKsKcsML

(MachineLearning)

ResultsReport

FeatureDBSQL

JoinsSQL

ETLSQL

(Extract,transform,&load)

FeatureSelect&

FeaturizaIon(behaviors)

SQL

RAWDATAPublic

&private

Copyright©2016,Oracleand/oritsaffiliates.Allrightsreserved.|

MachineLearning(ML):Scoring/PredicKonversusTraining/LearningCharacterisKcsPredic1on/Scoringoperatesonhugeamountsofdatawithlowcomputeintensity

MLScore/Predic1on

MLLearn/Train

%ofacKvity MostData *IniKal

ComputaKon O(n^2)Matrix-vector

O(n^3)Matrix-matrix

Data O(n^2) O(n^2)

ComputeIntensity(Compute/Data) Lowconstant O(n)

SPARCAdvantageDuetoMemoryBandwidth&design

3xto6xpercore

Upto1.3xpercore

OracleConfidenKal–HighlyRestricted 40

TrainingSet

Scoring/PredicKon

Results

Cantrainononeserverthenmovemodeltopredic$onserver(ex:StubHub)Modelslive1week!

Training/Learning

DatatoEvaluate

*Ini$alandthenoccasionallyupdatemodelsdatasamples

Model

Every4weeks