Spark Summit EU talk by Pat Patterson

26
Building Data Pipelines with Spark and StreamSets Pat Patterson Community Champion @metadaddy [email protected]

Transcript of Spark Summit EU talk by Pat Patterson

BuildingDataPipelineswithSparkandStreamSets

Pat PattersonCommunity Champion

@[email protected]

Agenda

DataDrift

StreamSetsDataCollector

RunningPipelinesonSparkToday

FutureSparkIntegration

Demo

Past ETL ETL

Emerging Ingest Analyze

DataSources DataStores DataConsumers

TheEvolutionofData-in-Motion

DataDrift- aDataEngineeringHeadache

Theunpredictable, unannounced andunendingmutation ofdatacharacteristics causedbytheoperation,maintenance andmodernization of

thesystems thatproducethedata

Structure Drift

Semantic Drift

Infrastructure Drift

SQLonHadoop(Hive) Y/YClickThroughRate

80%ofanalysttimeisspentpreparingandvalidatingdata,whiletheremaining20%isactualdataanalysis

Example:DataLossandCorrosion

Delayed and False Insights

SolvingDataDrift

Tools

Applications

DataStores DataConsumers

Poor Data QualityData DriftCustom code

Fixed-schema

DataSources

// DIY Custom Code

SolvingDataDrift

Trusted InsightsData KPIsTools

ApplicationsData Drift

Intent-Driven

Drift-Handling

DataStores DataConsumersDataSources

// DIY Custom Code

StreamSets DataCollector

Opensourcesoftwarefortherapiddevelopment andreliablyoperationofcomplexdata

flows.

➢ Intent-driven

➢ UIAbstraction

➢ Extensible

HandlingDriftwithHive

➢ Monitor data structure➢ Detect schema change➢ Alter Hive Metadata

RunningPipelinesonSparkToday

spark-submit--num-executors ...--archives ...--files ...--jar ...--class ... ● Container onSpark

● LeverageKafkaRDD

● Scaleoutforperformance

SDConSpark- ConnectivitySources

• Kafka

Destinations• HDFS• HBase• S3• Kudu• MapR DB• Cassandra• ElasticSearch• Kafka• MapR Streams• Kinesis• etc,etc,etc!

FutureDirections

RunPipelinesonDatabricks

● Container onDatabricks

● LeverageRESTAPI

● AddS3origin

{"name":"StreamSets Data Collector","new_cluster":{"num_workers":1

},"spark_jar_task":{"main_class_name":"com.streamsets..."

}}

BreakOutSparkProcessor

● Standalone containers,Sparkprocessor

● LeverageSparkcode

● CustomRDD

● StartlocalSparkjobforeachbatch

● Example usecases:runningimageclassification, sentimentanalysis

Local Mode

SparkProcessor- ConnectivitySources

• Kafka• S3• MapR Streams• JDBC• MongoDB• LocalFilesystem• Redis• JMS• HTTP• UDP• etc,etc,etc!

Destinations• HDFS• HBase• S3• Kudu• MapR DB• Cassandra• ElasticSearch• Kafka• MapR Streams• JDBC• etc,etc,etc!

DeepenSparkIntegration

● ContaineronSpark,Sparkprocessor

● Leverage Sparkcode

● CustomRDD

● StartSparkjob‘oncluster’ foreachpipeline

● Exampleusecases:training imageclassification,sentimentanalysis

SparkIntegrationArchitecture

SDConSpark- ConnectivityTomorrow

Sources• Kafka• S3• MapR Streams• JDBC• MongoDB• Redis• JMS• HTTP• UDP• ...anypartitionable datasource...

Destinations• HDFS• HBase• S3• Kudu• MapR DB• Cassandra• ElasticSearch• Kafka• MapR Streams• JDBC• etc,etc,etc!

Demo

Conclusion

StreamSetsDataCollectorbringsaUIabstractiontoSpark

Standalonecontainer+localSparkProcessorbringwideconnectivitytoSpark code

SparkContainer+SparkProcessorallowiterativeSparkcodeinpipelines

Resources

DownloadStreamSets DataCollectorhttps://streamsets.com/opensource

Contribute Codehttps://github.com/streamsets/datacollector

Get Involvedhttps://streamsets.com/community

ThankYou!

BackupSlides

StructureDrift

Datastructuresandformatsevolveandchange

unexpectedly

Implication:DataLoss

DataSquandering

DelimitedData

107.3.137.195 fe80::21b:21ff:fe83:90fa

AttributeFormatChanges

{“first“:“jon”“last“:“smith”“email“:“[email protected]”“add1“:“123Washington”“add2“:“”“city“:“Tucson”“state“:“AZ”“zip“:“85756”}

{“first“:“jane”“last“:“smith”“email“:“[email protected]”“add1“:“456Fillmore”“add2“:“Apt120”“city“:“Fairfield”“state“:“VA”“zip“:“24435-1001”“phone”:“401-555-1212”}

DataStructureEvolution

StructureDrift

SemanticDrift

Datasemantics changewithevolvingapplications

Implication:DataCorrosion

DataLoss

SemanticDrift

24122-52172 00-24122-52172

AccountNumberExpansion

M134:user{jsmith}readaccessgranted{ac:24122-52172}

M134:user{jsmith}readaccessgranted{ca.ac:24122-52172}

Namespace Qualification

………,3588310669797950,$91.41,jcb,K1088-W#9,……,6759006011936944,$155.04,switch,A6504-Y#9,……,6771111111151415,$37.78,laser,Q9936-T#9,……,3585905063294299,$164.48,jcb,S4643-H#9,……,5363527828638736,$117.52,mastercard,X3286-P#9,……,4903080150282806,$168.03,switch,I9133-W#3,………

Outlier/AnomalyDetection

InfrastructureDrift

PhysicalandLogicalInfrastructurechanges

rapidly

Implication:PoorAgility

OperationalDowntime

DataCenter1 DataCenter2 DataCentern

3rd PartyServiceProvider

Appa Appk

AppqCloud

Infrastructure

InfrastructureDrift