Past ETL ETL
Emerging Ingest Analyze
DataSources DataStores DataConsumers
TheEvolutionofData-in-Motion
DataDrift- aDataEngineeringHeadache
Theunpredictable, unannounced andunendingmutation ofdatacharacteristics causedbytheoperation,maintenance andmodernization of
thesystems thatproducethedata
Structure Drift
Semantic Drift
Infrastructure Drift
SQLonHadoop(Hive) Y/YClickThroughRate
80%ofanalysttimeisspentpreparingandvalidatingdata,whiletheremaining20%isactualdataanalysis
Example:DataLossandCorrosion
Delayed and False Insights
SolvingDataDrift
Tools
Applications
DataStores DataConsumers
Poor Data QualityData DriftCustom code
Fixed-schema
DataSources
// DIY Custom Code
SolvingDataDrift
Trusted InsightsData KPIsTools
ApplicationsData Drift
Intent-Driven
Drift-Handling
DataStores DataConsumersDataSources
// DIY Custom Code
StreamSets DataCollector
Opensourcesoftwarefortherapiddevelopment andreliablyoperationofcomplexdata
flows.
➢ Intent-driven
➢ UIAbstraction
➢ Extensible
RunningPipelinesonSparkToday
spark-submit--num-executors ...--archives ...--files ...--jar ...--class ... ● Container onSpark
● LeverageKafkaRDD
● Scaleoutforperformance
SDConSpark- ConnectivitySources
• Kafka
Destinations• HDFS• HBase• S3• Kudu• MapR DB• Cassandra• ElasticSearch• Kafka• MapR Streams• Kinesis• etc,etc,etc!
RunPipelinesonDatabricks
● Container onDatabricks
● LeverageRESTAPI
● AddS3origin
{"name":"StreamSets Data Collector","new_cluster":{"num_workers":1
},"spark_jar_task":{"main_class_name":"com.streamsets..."
}}
BreakOutSparkProcessor
● Standalone containers,Sparkprocessor
● LeverageSparkcode
● CustomRDD
● StartlocalSparkjobforeachbatch
● Example usecases:runningimageclassification, sentimentanalysis
Local Mode
SparkProcessor- ConnectivitySources
• Kafka• S3• MapR Streams• JDBC• MongoDB• LocalFilesystem• Redis• JMS• HTTP• UDP• etc,etc,etc!
Destinations• HDFS• HBase• S3• Kudu• MapR DB• Cassandra• ElasticSearch• Kafka• MapR Streams• JDBC• etc,etc,etc!
DeepenSparkIntegration
● ContaineronSpark,Sparkprocessor
● Leverage Sparkcode
● CustomRDD
● StartSparkjob‘oncluster’ foreachpipeline
● Exampleusecases:training imageclassification,sentimentanalysis
SDConSpark- ConnectivityTomorrow
Sources• Kafka• S3• MapR Streams• JDBC• MongoDB• Redis• JMS• HTTP• UDP• ...anypartitionable datasource...
Destinations• HDFS• HBase• S3• Kudu• MapR DB• Cassandra• ElasticSearch• Kafka• MapR Streams• JDBC• etc,etc,etc!
Conclusion
StreamSetsDataCollectorbringsaUIabstractiontoSpark
Standalonecontainer+localSparkProcessorbringwideconnectivitytoSpark code
SparkContainer+SparkProcessorallowiterativeSparkcodeinpipelines
Resources
DownloadStreamSets DataCollectorhttps://streamsets.com/opensource
Contribute Codehttps://github.com/streamsets/datacollector
Get Involvedhttps://streamsets.com/community
StructureDrift
Datastructuresandformatsevolveandchange
unexpectedly
Implication:DataLoss
DataSquandering
DelimitedData
107.3.137.195 fe80::21b:21ff:fe83:90fa
AttributeFormatChanges
{“first“:“jon”“last“:“smith”“email“:“[email protected]”“add1“:“123Washington”“add2“:“”“city“:“Tucson”“state“:“AZ”“zip“:“85756”}
{“first“:“jane”“last“:“smith”“email“:“[email protected]”“add1“:“456Fillmore”“add2“:“Apt120”“city“:“Fairfield”“state“:“VA”“zip“:“24435-1001”“phone”:“401-555-1212”}
DataStructureEvolution
StructureDrift
SemanticDrift
Datasemantics changewithevolvingapplications
Implication:DataCorrosion
DataLoss
SemanticDrift
24122-52172 00-24122-52172
AccountNumberExpansion
M134:user{jsmith}readaccessgranted{ac:24122-52172}
M134:user{jsmith}readaccessgranted{ca.ac:24122-52172}
Namespace Qualification
………,3588310669797950,$91.41,jcb,K1088-W#9,……,6759006011936944,$155.04,switch,A6504-Y#9,……,6771111111151415,$37.78,laser,Q9936-T#9,……,3585905063294299,$164.48,jcb,S4643-H#9,……,5363527828638736,$117.52,mastercard,X3286-P#9,……,4903080150282806,$168.03,switch,I9133-W#3,………
Outlier/AnomalyDetection
Top Related