Analytics on Spark & Shark @ Yahoo

download Analytics on Spark & Shark  @ Yahoo

of 21

  • date post

  • Category


  • view

  • download


Embed Size (px)


Analytics on Spark & Shark @ Yahoo. PRESENTED BY Tim Tully. December 3, 2013. Overview. Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment Future Spark/Shark/ Hadoop Stack Conclusion. - PowerPoint PPT Presentation

Transcript of Analytics on Spark & Shark @ Yahoo

Yahoo! Presentation Template

Analytics on Spark & Shark @YahooPRESENTED BY Tim Tully

December 3, 20131OverviewLegacy / Current Hadoop ArchitectureReflection / Pain PointsWhy the movement towards Spark / SharkNew Hybrid EnvironmentFuture Spark/Shark/Hadoop StackConclusion2NFSNFSNFSNFSMetaDataC++WorkerC++ WorkerC++WorkerC++WorkerPerl LauncherSSHNFS/d1/data/20030901/partition/b1.gz/d4/data/20030901/partition/b16.gzSome Fun: Old-School Data Processing(1999-2007)3Custom log collection infrastructure depositing onto NFS-based storageLogs moved onto Hadoop HDFSMultiple Hadoop instancesPig/MR ETL processing, massive joins, load into warehouseAggregations / Report Generation in Pig, MapReduce, HiveReports loaded into RDBMSUI / web services on topRealtime Stream Processing:Storm on YARNPersistence:Hbase, HDFS/Hcat, RDBMSsCurrent Analytics Architecture4Data Movement & CollectionStaging/DistributionYARNBI/OLAPAdhoc|MLRDBMS/ NoSQLBatch Processing / Data PipelinesReal-time Stream ProcessingRealtime AppsETL / HDFS Mobile AppsColosAd ServersPixel ServersWeb PagesNative MRPig / HiveStream Processing / Queues Current High-Level Analytics Dataflow5Legacy Architecture Pain PointsMassive data volumes per day (many, many TB)Pure Hadoop stack throughout Data WranglingReport arrival latency quite highHours to perform joins, aggregate dataCulprit - Raw data processing through MapReduce just too slowMany stages in pipeline chained together Massive joins throughout ETL layerLack of interactive SQLExpressibility of business logic in Hadoop MR is challengingNew reports and dimensions requires engineering throughout stack6Aggregate Pre-computation ProblemsProblem: Pre-computation of reportsHow is timespent per user distributed across desktop and mobile for Y! Mail? Extremely high cardinality dimensions, ie, search query termCount distincts Problem: Sheer number of reports along various dimensionsReport changes required in aggregate, persistence and UI layerPotentially takes weeks to monthsBusiness cannot wait7Problem SummaryOverwhelming need to make data more interactiveShorten time to data access and report publicationAd-hoc queries need to be much faster than Hive or pure Hadoop MR. Concept of Data Workbench: business specific views into dataExpressibility of complicated business logic in Hadoop becoming a problemVarious verticals within Yahoo want to interpret metrics differentlyNeed interactive SQL queryingNo way to perform data discovery (adhoc analysis/exploration)Must always tweak MR Java code or SQL query and rerun big MR jobCultural shift to BI tools on desktop with low latency query performance

8Where do we go from here?How do we solve this problem within the Hadoop ecosystem?Pig on Tez?Hive on Tez?No clear path yet to making native MR/Pig significantly fasterBalance pre-aggregated reporting with high demand for interactive SQL access against fact data via desktop BI tools How do we provide data-savvy users direct SQL-query access to fact data?

9Modern Architecture: Hadoop + SparkBet on YARN: Hadoop and Spark can coexistStill using Hadoop MapReduce for ETLLoading data onto HDFS / HCat / Hive warehouseServing MR queries on large Hadoop clusterSpark-on-YARN side-by-side with Hadoop on same HDFSOptimization: copy data to remote Shark/Spark clusters for predictable SLAsWhile waiting for Shark on Spark on YARN (Hopefully early 2014)10Data Movement & CollectionStaging/DistributionYARNBI/OLAPAdhocRDBMS/ NoSQLBatch Processing / Data PipelinesReal-time Stream ProcessingRealtime Apps / QueryingETL / HDFS Mobile AppsColosAd ServersPixel ServersWeb PagesSpark/MRHiveStream Processing / Queues Analytics Stack of the FutureSparkShark11View 1View 2View nWhy Spark?Cultural shift towards data savvy developers in YahooRecently, the barrier to entry for big data has been loweredSolves the need for interactive data processing at REPL and SQL levelsIn-memory data persistence obvious next step due to continual decreasing cost of RAM and SSDsCollections API with high familiarity for Scala devsDevelopers not restricted by rigid Hadoop MapReduce paradigmCommunity support accelerating, reaching steady stateMore than 90 developers, 25 companiesAwesome storage solution in HDFS yet processing layer / data manipulation still sub-optimalHadoop not really built for joinsMany problems not Pig / Hive ExpressibleSlowSeemless integration into existing Hadoop architecture

12Why Spark? (Continued)Up to 100x faster than Hadoop MapReduceTypically less code (2-5x)Seemless Hadoop/HDFS integrationRDDs, Iterative processing, REPL, Data LineageAccessible Source in terms of LOC and modularityBDAS ecosystem:Spark, Spark Streaming, Shark, BlinkDB, MLlibDeep integration into Hadoop ecosystemRead/write Hadoop formatsInterop with other ecosystem componentsRuns on Mesos & YARNEC2, EMRHDFS, S313Spark BI/Analytics Use Cases 14Obvious and logical next-generation ETL platformUnwind chained MapReduce job architectureETL typically a series of MapReduce jobs with HDFS output between stagesMove to more fluid data pipelineJava ecosystem means common ETL libraries between realtime and batch ETLFaster executionLower data publication latencyFaster reprocessing times when anomalies discoveredSpark Streaming may be next generation realtime ETLData Discovery / Interactive AnalysisSpark Hardware159.2TB addressable cluster96GB and 192GB RAM machines112 MachinesSATA 1x500GB 7.2k Dual hexa core Sandy BridgeLooking at SSD exclusive clusters400GB SSD 1x400GB SATA 300MB/sWhy Shark?First identified Shark at Hadoop Summit 2012After seeing Spark at Hadoop Summit 2011Common HiveQL provides seemless federation between Hive and SharkSits on top of existing Hive warehouse dataMultiple access vectors pointing at single warehouseDirect query access against fact data from UIDirect (O/J)DBC from desktop BI toolsBuilt on shared common processing platform

16Added a feature to Shark that prunes columns that are not selected in a query so the iterator advancement for those columns is not required.Gives substantial improvement in performance when the total number of columns >> number selected (typical in OLAP)

16Yahoo! Shark Deployments / Use CasesAdvertising / Analytics Data WarehouseCampaign ReportingPivots, time series, multi-timezone reportingSegment ReportingUnique users across targeted segmentsAd impression availability for given segmentOverlap analysis fact to fact overlapOther Time Series AnalysisOLAPTableau on top of SharkCustom in-house cubing and reporting systemsDashboardsAdhoc analysis and data discovery

17Yahoo! ContributionsBegan work in 2012 on making Shark more usable for interactive analytics/warehouse scenariosShark Server for JDBC/ODBC access against TableauMulti-tenant connectivityThreadsafe accessMap Split PruningUse statistics to prune partitions so jobs dont launch for splits w/o dataBloom filter-based pruner for high cardinality columnsColumn pruning faster OLAP query performanceMap-side joinsCached-table Columnar Compression (3-20x)Query cancellation

18Physical ArchitectureSpark / Hadoop MR side-by-side on YARNSatellite Clusters running SharkPredictable SLAsGreedy pinning of RDDs to RAMAddresses scheduling challengesLong-termShark on Spark-on-YARNGoal: early 2014Historical DW (HDFS)Hadoop MR (Pig, Hive, MR)YARNSatellite Shark ClusterSparkLarge Hadoop ClusterSatellite Shark Cluster19Future ArchitecturePrototype migration of ETL infrastructure to pure Spark jobs Breakup chained MapReduce pattern into single discrete Spark jobPort legacy Pig/MR ETL jobs to Spark (TBs / day)Faster processing times (goal of 10x)Less code, better maintainability, all in Scala/SparkLeverage RDDs for more efficient joinsPrototype Shark on Spark on YARN on Hadoop clusterDirect data access over JDBC/ODBC via desktopExecute both Shark and Spark queries on YARNStill employ satellite cluster model for predictable SLAs in low-latency situationsUse YARN as the foundation for cluster resource management20ConclusionsBarrier to entry for big data analytics reduced, Spark at the forefrontYahoo! now using Spark/Shark for analytics on top of Hadoop ecosystemLooking to move ETL jobs to SparkSatellite cluster pattern quite beneficial for large datasets in RAM and predictable SLAsClear and obvious speedup compared to HadoopMore flexible processing platform provides powerful base for analytics for the future21