Spark at Bloomberg: Dynamically Composable Analytics

32
Spark @ Bloomberg: Dynamic Composable Analytics Partha Nageswaran Sudarshan Kadambi BLOOMBERG L.P.

Transcript of Spark at Bloomberg: Dynamically Composable Analytics

Page 1: Spark at Bloomberg:  Dynamically Composable Analytics

Spark @ Bloomberg: Dynamic Composable Analytics

Partha NageswaranSudarshan KadambiBLOOMBERG L.P.

Page 2: Spark at Bloomberg:  Dynamically Composable Analytics

Bloomberg Spark Server

(Persistent)Spark

Context

Request Handler

2

SparkServerization atBloomberghasculminated inthecreationoftheBloomberg SparkServer

Function Transform

Registry (FTR)

Managed DataFrameRegistry

IngestionManager

RequestProcessor

RequestProcessor

Declarative QueryRequest

Processor

JVM

Page 3: Spark at Bloomberg:  Dynamically Composable Analytics

Spark Serverization – Motivation

3

• Stand-aloneSparkAppsonisolatedclustersposechallenges:

– Redundancy in:

» CraftingandManagingofRDDs/DFs

» Codingofthesameorsimilartypesoftransforms/actions

– Managementofclusters,replicationofdata,etc.

– Analyticsareconfined tospecificcontentsetsmakingCross-AssetAnalytics muchharder

– NeedtohandleReal-timeingestion ineachApp

Spark Cluster

Spark App

Spark Cluster

Spark Server

Spark App

Spark App

Spark Cluster

Spark App

Page 4: Spark at Bloomberg:  Dynamically Composable Analytics

Dynamic Composable Analytics• Compositional Analyticsarecommon placeintheFinancialSector

Decile Rankthe14-dayRelativeStrengthIndexofActiveEquityStocks:

DECILE(RSI(

Price,14,['IBMUSEquity','VODLNEquity',…]

))

• PriceisdataabstractedasaSparkDataFrame

• RSI,DECILEarebuilding blockanalytics,expressibleasSparktransformsandactions

4

Page 5: Spark at Bloomberg:  Dynamically Composable Analytics

Dynamic Composable Analytics• Anotherusecase maywanttocomposePercentilewithRSI

PercentileRankthe14-dayRelativeStrengthIndexofActiveEquityStocks:

PERCENTILE(RSI(

Price,14,['IBMUSEquity','VODLNEquity',…]

))

• OrPercentilewithROC,etc.Andthecompositionsmaybearbitrarilycomplex

5

Page 6: Spark at Bloomberg:  Dynamically Composable Analytics

Dynamic Composable Analyticsdef RSI(df:DataFrame,period:Int=14) : DataFrame ={

val smmaCoeff =udf((i:Double)=>scala.math.pow(period-1,i-1)/scala.math.pow(period,i))val rsi_from_rs =udf((n:Double,d:Double)=>100- 100*d/(d+n))val rsi_window=Window.partitionBy('id).orderBy('date.desc)

df.withColumn("weight",smmaCoeff(row_number.over(rsi_window))).withColumn("diff",'value- lead('value,1).over(rsi_window)).withColumn("U",when('diff>0,'diff).otherwise(0)).withColumn("D",when('diff<0,abs('diff)).otherwise(0)).groupBy('id).agg(rsi_from_rs(sum('U*'weight),sum('D*'weight))as'value)

}

def Decile(df:Dataframe) : DataFrame ={df.withColumn("value",ntile(10).over(Window.orderBy('value.desc)))

}

Ack: Andrew Foster6

Page 7: Spark at Bloomberg:  Dynamically Composable Analytics

Function Transform Registry• MaintainaRegistryofAnalytic functions (FTR)withfunctions expressedas

ParametrizedSparkTransformsandActions

• Functionscancompose otherfunctions,alongwithadditional transforms,withintheRegistry

• Registrysupports 'bind'and'lookup'operations

7

Function Transform Registry (FTR)

DecileFUNCTIONS SPARK IMPL.

…Percentile …

… …

Page 8: Spark at Bloomberg:  Dynamically Composable Analytics

Bloomberg Spark Server

(Persistent)Spark Context

Request Handler

8

Function Transform

Registry (FTR)

JVM

Page 9: Spark at Bloomberg:  Dynamically Composable Analytics

Request Processor• RequestProcessors (RPs)aresparkapplications thatorchestratecomposition of

analyticsonDataFrames

– RPscomplywithaspecification thatallowsthemtobehostedbytheBloombergSparkServer

– Eachrequest(suchas:computetheDecile RankoftheRSI)ishandledbyaRequestProcessorthatlooksupfunctionsfromtheFTR,Composesthemandapplies themtoDataFrames

9

Request Handler

Request Processor

.

FTR

Declarative QueryRequest

Processor

Page 10: Spark at Bloomberg:  Dynamically Composable Analytics

Bloomberg Spark Server

(Peristent)Spark

Context

Request Handler

10

Function Transform

Registry (FTR)

JVM

RequestProcessor

RequestProcessor

Declarative QueryRequest

Processor

Page 11: Spark at Bloomberg:  Dynamically Composable Analytics

Managed Data Frames

• BesideslocatingfunctionsfromtheFTR,RequestProcessorshavetopassinDataFramestothesefunctionsasinputs

• RatherthaninstantiateDataFrames,lookupDataFramesfromaDataFramesRegistry

– SuchDataFramesarecalledManagedDataFrames(MDF)

– TheRegistrythatManagestheseDataFramesistheManagedDataFrameRegistry(MDFRegistry)

11

Page 12: Spark at Bloomberg:  Dynamically Composable Analytics

Introducing Managed DataFrames (MDFs)

• AManagedDataFrame (MDF)isanamedDataFrame,optionallycombinedwithExecutionMetadata

– MDFscanbelocatedbynameORbyanyColumnNamedefined intheSchemaof thecorrespondingDF

• ExecutionMetadataincludes:

– DataDistribution metadatacapturesinformationabout thedatadepth, histogram information, etc.

– E.g.:AmanagedDataFrame forpricingof stocks,representing 2yearsofhistoricaldata andanotherforrepresenting 30yearsofhistoricaldata

MDF

Price DF<ID, Price>

Name: Shallow

PriceMDF

ExecutionMetadata:* 2 Yr Price

History

MDF

Price DF<ID, Price>

Name: Deep

PriceMDF

ExecutionMetadata:

* 30 Yr Price History

12

Page 13: Spark at Bloomberg:  Dynamically Composable Analytics

Managed DataFrames

– DataDerivationmetadatawhicharemathematicalexpressions thatdefinehowadditional columnscanbesynthesized fromexistingcolumns intheschema

– E.g.:adjPrice isaderivedColumn, definedintermsofthebasePricecolumn

– Inessence,anMDFwithdataderivationmetadatahaveaSchemathatisaunionofthecontainedDFandthederivedcolumns

MDF

Name:ShallowPriceDF

ExecutionMetadata:* 2 Yr Price

History* adjPrice =

Price – 3% of Price

Price DF<ID, Price>

MDF

Name:Deep

PriceDF

ExecutionMetadata:

* 30 Yr Price History

* adjPrice = Price – 1.75% of

Price

Price DF<ID, Price>

13

Page 14: Spark at Bloomberg:  Dynamically Composable Analytics

The MDF Registry

• TheMDFRegistrywithintheBloombergSparkServer providessupport for:

– BindingMDFsbyName

– LookingupMDFsbyName

– LookingupMDFbyaColumn Name(anelementoftheMDFSchema),etc.

• TheMDFRegistrymaintainsa'table'thatassociates theNameoftheMDFwiththeDFreference andColumnsintheDF

MDFRegistryName Columns DF

Ref.MetaData

ShallowPriceDF

Price,adjPrice

… …

DeepPriceDF

……

Price,adjPrice

14

Page 15: Spark at Bloomberg:  Dynamically Composable Analytics

Bloomberg Spark Server

(Peristent)Spark

Context

Request Handler

15

Function Transform

Registry (FTR)

JVM

RequestProcessor

RequestProcessor

Declarative QueryRequest

Processor

Managed DataFrameRegistry

Page 16: Spark at Bloomberg:  Dynamically Composable Analytics

Flow of Requests

• RequestProcessorswithintheSparkServerorchestrateanalytics

– TheseRPshaveaccesstotheRegistryandFTRs

– AreresponsibleforcomposingtransformsandactionsononeormoreMDFs

– MaydynamicallybindadditionalMDFs(materializedorotherwise)forusebyotherApps

Request Handler

Request Processor

.

MDF Registry

lookup MDFs

FTRs

applyFunction

MDFs

decoratewithTransforms

collect

16

Page 17: Spark at Bloomberg:  Dynamically Composable Analytics

Bloomberg Spark Server

Spark Context

Request Processor

Request Processor

Declarative QueryRequest Processor

Request Handler

MDF Registry

MDF

17

Function Transform

Registry (FTR)

RSI …

use MDF

MDF

MDF

17

Page 18: Spark at Bloomberg:  Dynamically Composable Analytics

Bloomberg Spark Server

Spark Context

Request Processor

Request Processor

Declarative QueryRequest Processor

Request Handler

MDF Registry

18

Function Transform

Registry (FTR)

RSI …

use

18

Ingestion Manager

MDF1

MDF2

1 2

1 2

Page 19: Spark at Bloomberg:  Dynamically Composable Analytics

Schema Repository

19

• Enterprise-widedatapipeline

• External(toSpark)schemarepositoryandservice

• EnablesMDFlookupbyadatasetschemaelement

• Analyticexpressionscannowbecomposedoverdataelements

Page 20: Spark at Bloomberg:  Dynamically Composable Analytics

Execution Metadata

20

• DatasetSourceConnection Identifiers

• BackingStores

• Real-time Topics

• StorageLevel&RefreshRate

• SubsetPredicate,etc.

Page 21: Spark at Bloomberg:  Dynamically Composable Analytics

Ad-hoc Cross-Domain Analytics

21

• Registrationofpre-materializedDataFrames

• Collaborativeanalyticsbetweenapplicationworkflows

• DynamiccreationofManagedDataFrames

• SparkServershavedatapertainingtoasingledomainmaterialized

• Ad-hoc cross-domainanalyticsrequirescapabilitytosynthesizeMDFsondemand

Page 22: Spark at Bloomberg:  Dynamically Composable Analytics

Content Subsetting

22

• Highvaluedatasub-settedwithinSpark

• Reducecostofqueryingexternaldatastore

• Specifiedasafilterpredicateattimeofregistration

• E.g.Membercompaniesofpopularindices[Dow30,S&P500,…]haverecordsplacedwithinSpark

Page 23: Spark at Bloomberg:  Dynamically Composable Analytics

Content Subsetting

23

• SeamlessunificationofdatainSpark(DFsubset)andbackingstore(DFsubset’)

(DFsubset UDFsubset’).filter(query)= DFsubset.filter(query)UDFsubset’.filter(query)

• Datasetownersprovidedknobsforcostvsperformance.

• LRUcachelikemechanismplannedinthefuture

• MakesenseasacapabilitynativetoSparkdataframes

Page 24: Spark at Bloomberg:  Dynamically Composable Analytics

Ingestion: Periodic Refresh

24

• PeriodicdatapullintoSparkfromthebackingstore

• Subsetcriteriaappliedduringdataretrieval

• Usedwhenadatasethasabackingstore,butnorealtimeupdatestreamthatwecantapinto

• Datasetownershavecontroloverstoragelevelofthedataframes createdwithinagivenMDF

Page 25: Spark at Bloomberg:  Dynamically Composable Analytics

Ingestion: Stream Reconciliation

25

• Analyticsneedstobelow-latencywithrespecttoqueries,butalsodatafreshness

• Sincedataisbeingsub-settedwithinSpark,needtokeepthesubsetuptodate

• DatasetspublishedtodifferentKafkatopics.

• 1:1mappingbetween datasets,topicsandDStreams.

Page 26: Spark at Bloomberg:  Dynamically Composable Analytics

Ingestion: Stream Reconciliation

26

Backing Store

U1 U2 U3 UN DFsubset

S1 S2 S3 SNDFN

MDF -PriceHistory

Real-Time Stream

(update state)

(Avro Deserialize, Subset Predicate)

(convert to DF-seq)

Similar intent as Structured Streaming, to be introduced in Spark 2.0

Page 27: Spark at Bloomberg:  Dynamically Composable Analytics

Ingestion: Data Transformation• Datainbackingstoresmayneedrepresentationtransforms

beforebeingusedinqueries

• Datainmultipletablesdenormalized intoasingleDFwithinSpark

• Or,quicklyseeeffectofdifferentstoragerepresentationsonperformance,withoutchangingtherepresentationinthebackingstore

• Implementedvia.usertransformsassociatedwithagivenMDF

Page 28: Spark at Bloomberg:  Dynamically Composable Analytics

Spark Server: Memory Management

28

• AnMDFcontainsmultiplegenerationofDFs,beinggeneratedanddestroyed

• MultiplegenerationsoperateduponbyRPsatgivenpointintime

• ReferencecountingtokeeptrackofwhatDFsarebeingusedandbywhom

• Longrunningqueriesabortedforforcedreclamation

Page 29: Spark at Bloomberg:  Dynamically Composable Analytics

Query Consistency

29

• Multiplequeriesneedtooperateonsamesnapshotofdata

• Howtoachieve,ifdataconstantlychangingunderneath?

• EachDFwithinMDFassociatedwithtimeepoch

• Registrylookupwithareferencetime

• Time-alignsub-setted dataframeswithdatainbackingstore

Page 30: Spark at Bloomberg:  Dynamically Composable Analytics

Spark for Online Analytics

30

– HighAvailabilityofSparkDriver• Highbootstrapcosttoreconstructingclusterandcachedstate• NaïveHAmodels(suchasmultipleactiveclusters)surfacequeryinconsistency

– HighAvailabilityofRDDPartitions• Withsubset oruniversecached,lostRDDpartitionskillqueryperformance

– PerformanceConsistency• Performancegatedbyslowestexecutor• HighAvailabilityandLowTailLatencycloselyrelated

– Interactionseffectsbetweenlow-latencyqueriesandlow-latencyupdates• NotoMinimalsandboxingbetweenjobssharingexecutorJVMs

FirstBloombergcontribution:SPARK-15352

Page 31: Spark at Bloomberg:  Dynamically Composable Analytics

Spark Server Acknowledgements

Andrew Foster Joe Davey Shubham Chopra

Nimbus Goehausen Tracy Liang

Page 32: Spark at Bloomberg:  Dynamically Composable Analytics

THANK [email protected]@bloomberg.net