Apache spark

download Apache spark

of 20

  • date post

  • Category


  • view

  • download


Embed Size (px)


Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms.

Transcript of Apache spark

  • 1. Guide :Mrs. Juhi SinghSubmitted by: Hitesh DuaCSE 4thYear05510402711

2. Sustained exponential growth, as one of the most active Apache projects 3. ApacheSparkisanopensourceparallelprocessingframeworkthatenablesuserstorunlarge-scaledataanalyticsapplicationsacrossclusteredcomputers.ApacheSparkcanprocessdatafromavarietyofdatarepositories.Itsupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory. 4. Open SourceAlternative to Map Reduce for certain applicationsA low latency cluster computing system for very large data setsHigher level library for stream processing, through Spark Streaming.May be 100 times faster than Map Reduce forIterative algorithmsInteractive data mining 5. Started as a research project at theUC Berkeley AMPLabin 2009, and was open sourced in early 2010.After being released, Spark grew a developer community on GitHuband entered Apache in 2013 as its permanent home.Codebase sizeSpark : 20,000 LOCHadoop 1.0 : 90,000 LOC 6. MapReduce greatly simplified big data analysis.But as soon as it got popular, users wanted more:More complex, multistage applications (e.g. iterative graph algorithms and machine learning)More interactive ad-hoc queries.Both multistage and interactive apps require fasterdata sharing across parallel jobs. 7. Resilient Distributed Datasets (RDDs) are basic building block.Distributed collections of objects that can be cached in memory across cluster nodes.Automatically rebuilt on failure.RDD operationsTransformations: Creates new dataset from existing one. e.g. Map.Actions: Return a value to a driver program after running computation on the dataset. e.g. Reduce.Spark : Programming Model 8. SparkStackExtensionSparkpowersastackofhigh-leveltoolsincludingSparkSQLSparkStreaming.MLlibformachinelearningGraphXYoucancombinetheseframeworksseamlesslyinthesameapplication. 9. SparkStreamingisaSparkcomponentthatenablesprocessinglivestreamsofdata.Examplesofdatastreamsincludelogfilesgeneratedbyproductionwebservers,orqueuesofmessagescontainingstatusupdatespostedbyusersofawebservice 10. GraphXisalibraryaddedinSpark0.9thatprovidesanAPIformanipulatinggraphs(e.g.,asocialnetworksfriendgraph)andperforminggraph-parallelcomputations.Allowsustocreateadirectedgraphwitharbitrarypropertiesattachedtoeachvertexandedge.GraphXalsoprovidessetofoperatorsformanipulatinggraphslibraryofcommongraphalgorithms(e.g.,PageRankandtrianglecounting). 11. MLlibprovidesmultipletypesofmachinelearningalgorithms,includingbinaryclassification,regression,clusteringandcollaborativefiltering.Supportsfunctionalitysuchasmodelevaluationanddataimport.Designedtoscaleoutacrossacluster.MLlibcontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce. 12. Spark SQL provides support for interacting with Spark via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HiveQL).Spark SQL represents database tables as Spark RDDs and translates SQL queries into Spark operations.Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark.Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.