Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal...

22
LEARNINGS USING SPARK STREAMING & DATAFRAMES FOR WALMART SEARCH Yan Zheng Nirmal Sharma Director, Search-Walmart Labs Principle Engineer, Search - Walmart Labs

Transcript of Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal...

Page 1: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

LEARNINGS USING SPARK STREAMING & DATAFRAMES FOR WALMART SEARCH

Yan Zheng Nirmal SharmaDirector, Search-Walmart Labs Principle Engineer, Search - Walmart Labs

Page 2: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Search growth in last couple of years

• Product Catalog is growing exponentially• Product updates by merchants happen almost real time• Price update happens almost real time• Inventory update happens almost real time• Data used for relevance signals increased 10 times• Number of functionalities/business use cases to support increased a lot• Need for real time analytics increased to analyze data quickly to make faster

business decisions

Page 3: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Product catalog growth…

Page 4: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Other data growth…

0

20

40

60

80

100

120

140

2014 2015 2016

Data(TB)

Page 5: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Old Search Architecture (simplified version )

Page 6: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

• Issue with the old architecture was that the index update was happening once a day……and that was taking us back in terms of user experience and business

Page 7: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

New Search Architecture

Page 8: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

How we did it

• Started capturing all the catalog updates in Kafka and then using spark streaming to process these real time events and make them available to our indexes ( processing upto 10,000 events per sec ).

• Micro services using spark streaming to directly update price, inventory etc. details to indexes (close to 8000 events per sec ).

• Spark streaming and custom built elastic search data loader to load data directly to ES for real time analytics ( processing 20,000-25,000 events per sec )

• Custom hadoop and spark jobs to process user data faster for all our data science signals (50 TB - 100 TB data per batch ) and also to make data available faster for analyst and business people.

• Started updated all our BI reports with couple of hours which used to take days ( or sometimes days )

Page 9: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Further deep dive in technologies used……..

• Technology stack used for our data pipelines– Spark – Streaming, Dataframes– Hadoop – Hive, Map-Reduce– Cassandra – for lookup ( fast read/write )– Kafka – Event Processing– Elastic Search – for logging and analytics– Solr – Indexing walmart.com

Page 10: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

How we used spark dataframe to build scalable and flexible pipelines…..

Page 11: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

• DATA PIPELINES:1. Batch processing ( Analytics, Data Science )2. Real time processing ( Index update, micro services)

Page 12: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

What is required to build data pipelines

Page 13: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

• So the issue is that there no single uniform language to build data pipeline

• No easier way to reuse the code or templating the code for others to reuse for similar work

Page 14: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Dataframe has the potential to become the next unified language for data engineering………

Page 15: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Here are some examples to explain….

• This is the current code for K-Means clustering using python, hive, java(for UDF )

• https://gecgithub01.walmart.com/LabsSearch/DOD-BE/tree/master/src/main/scripts/query_categorization_daily

• The current python code is more than 1000 line which includes first data preparation, then data transformation to calculate feature vector and then model training and finally data post procession to validate and store data

Page 16: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

This is the new code for the same K-Means clustering using Spark Dataframes

• The new code is hardly contains 60-70 lines• The whole code is just one single file• https://gecgithub01.walmart.com/LabsSearch/polaris-data-

gen/blob/master/application/polaris-analytics/kmeans-query-tier/src/main/scala/QueryClustering.scala

• And the additional advantage is that the whole code is parallelized and just take 1/5th

of the time taken by original code

Page 17: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Another example is our scalable Anomaly framework

• For any data pipeline data quality is the key– All data is correlated in one way or another because almost all data feeds our search– Upfront data checks at source is equally important as final data check at target– Cause and effect analysis– Easy to use – Pluggable to all kind of data pipelines– Easy to enhance– Light weight and easy to install

Page 18: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

This is how we can run ETL’s using templates and the process using Dataframes..

• {"check_desc": "primary_Category_path_null_count","check_id" : 1,"check_metric" : ["count"],"backfill": [0],"fact_info" : [{"table_name" : "item_attribute_table","table_type" : [],"key_names" : [],"check_metric_field": "","database_name" : "polaris","filters" : ["primary_category_path is null"],"run_datetime" : "LATEST_DATE_MINUS_10","datetime_partition_field" : "data_date","datetime_partition_index" : 0,"other_partition_fields" : ""

}]

Page 19: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Code snippet…• val aggsResult = metric match {

case s: Vector[String] if(s.isEmpty ) => resultSet //.map(row => (check_id.toString, row._1, row._2)) // toDF("check_id","key", "aggValue") case _ => val caseResult = { metric(0) match { case "average" => resultSet.mapValues(_.toDouble) .aggregateByKey(init)((a, b) => (a._1 + b, a._2 + 1), (x, y) => (x._1 + y._1, x._2 + y._2)) .mapValues(x => x._1 / x._2) case "sum" => resultSet.mapValues(_.toDouble) .reduceByKey((a, b) => a + b) case "count" | "dups" => resultSet .aggregateByKey(0.0)((a, b) => a + 1.0, (a, b) => a + b) //this is for count case "max" => resultSet.mapValues(_.toDouble) .reduceByKey((a, b) => max(a.toDouble, b.toDouble)) case "min" => resultSet.mapValues(_.toDouble) .reduceByKey((a, b) => min(a.toDouble, b.toDouble)) case "countdistinct" => resultSet .aggregateByKey(0.0)((a, b) => a + 1.0, (a, b) => a + b) } } caseResult.mapValues(_.toString) //.map(row => (check_id.toString, row._1, row._2) ) //.toDF("check_id","key", "aggValue") }

Page 20: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Reusable templates…….

• "alert_files": [ { "check_desc": "Index file size for gm should be in range", "check_id": 12, "check_metric": ["size"], "backfill": [0], "file_info": [ { "file_type": "hadoop", "file_path" : "/tmp/polaris/nsharm9/alertN ame=misc_outliers/LATEST_PAR T_DA TE_MIN US_10" } ] } ], "alert_cmds": [ { "check_desc": "file counts", "check_id": 14, "backfill": [0], "cmd_info": [ { "cmd": "ls -ls | wc -l" } ] } ], "alert_sqls": [ { "check_desc": "some counts", "check_id": 15, "backfill": [0], "query_info": [ { "query": "select count(*) as check_agg from users where partition_epoch= LATEST_PART_EPOC H " } ] } ]

Page 21: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Code snippet….• /*File processing*/• val output =

obj.file_info.map(x => (x.file_type, inMetric) match {case ("hadoop","size") =>

SearchFileSystem.getHDFSFileDetails(SearchFileSystem.evalFilePartition(x.file_path,fillDays)).getOrElse("length", 0.toDouble)

case ("hadoop","filecounts") => SearchFileSystem.getHDFSFileDetails(SearchFileSystem.evalFilePartition(x.file_path,fillDays)).getOrElse("fileCount", 0.toDouble)

case ("hadoop","dircounts") => SearchFileSystem.getHDFSFileDetails(SearchFileSystem.evalFilePartition(x.file_path,fillDays)).getOrElse("directoryCount", 0.toDouble)

case ("local","size") => SearchFileSystem.getLocalFileDetails(SearchFileSystem.evalFilePartition(x.file_path,fillDays)).getOrElse("directoryCount", 0.toDouble)

case ("local","filecounts") => SearchFileSystem.getLocalFileDetails(SearchFileSystem.evalFilePartition(x.file_path,fillDays)).getOrElse("directoryCount", 0.toDouble)

case ("local","dircounts") => SearchFileSystem.getLocalFileDetails(SearchFileSystem.evalFilePartition(x.file_path,fillDays)).getOrElse("directoryCount", 0.toDouble)})map(x => ETLOutput(inCheckId, inCheckId, x.toString) )

Page 22: Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summit East talk by Nirmal Sharma and Yan Zheng

Thank You.Contact information or call to action goes here.