Spark Concepts - Spark SQL, Graphx, Streaming

download Spark Concepts - Spark SQL, Graphx, Streaming

of 34

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Spark Concepts - Spark SQL, Graphx, Streaming

Apache SparkConcepts - Spark SQL, GraphX, StreamingPetr ZapletalCake SolutionsApache Spark and Big DataHistory and market overviewInstallationMLlib and Machine Learning on SparkPorting R code to Scala and SparkConcepts - Spark SQL, GraphX, StreamingSparks distributed programming modelDeploymentTable of contentsResilient Distributed DatasetsSpark SQLGraphXSpark StreamingQ & ASpark Modules

Resilient Distributed DatasetsImmutable, distributed collection of recordsLazy evaluation, caching option, can be persistedNumber of operations & transformationsCan be created from data storage or different RDD

Spark SQLSparks interface to work with structured or semistructured dataStructured dataknown set of fields for each record - schemaMain capabilitiesload data from variety of structured sourcesquery the data with SQLintegration between Spark (Java, Scala and Python API) and SQL (joining RDDs and SQL tables, using SQL functionality)More than SQLUnified interface for structured data

SchemaRDDRDD of row objects, each representing a recordKnown schema (i.e. data fields) of its rowsBehaves like regular RDD, stored in more efficient mannerAdds new operations, especially running SQL queriesCan be created fromexternal data sourcesresults of queriesregular RDDUsed in ML Pipeline APISchemaRDD

Getting StartedEntry points:HiveContextsuperset functionality, Hive relatedSQLContext

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.[3][4] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[5]

Loads input JSON file into SchemaRDDUses context to execute queryQuery ExampleLoading and Saving DataSupports number of structured data sourcesApache Hivedata warehouse infrastructure on top of Hadoopsummarization, querying (SQL-like interface) and analysisParquetcolumn-oriented storage format in Hadoop ecosystemefficient storage of records with nested fieldsJSONRDDsJDBC/ODBC Serverconnecting Business Intelligence toolsremote access to Spark clusterGraphXNew Spark API for graphs and graph-parallel computationResilient Distributed Property Graph (RDPG, extends RDD)directed multigraph ( -> parallel edges) properties attached to each vertex and edgeCommon graph operations (subgraph computation, joining vertices, ...)Growing collection of graph algorithmsMotivationGrowing scale and importance of graph dataApplication of data-parallel algorithms to graph computation is inefficientGraph-parallel systems (Pregel, PowerGraph, ...) designed for efficient execution of graph algorithmsdo not address graph construction & transformationlimited fault tolerance & data mining support

Performance Comparison

Connected Components and PageRank algorithms

For Spark we implemented the algorithms both using idiomatic dataflow operators (Naive Spark, as described in Section 3.2) and using an optimized implementation (Optimized Spark) that eliminates movement of edge data by pre-partitioning the edges to match the partitioning adopted by GraphX.

We have excluded Giraph and Optimized Spark from Figure 7c because they were unable to scale to the larger web-graph in the allotted memory of the cluster. While the basic Spark implementation did not crash, it was forced to re-compute blocks from disk and exceeded 8000 seconds per iteration. We attribute the increased memory overhead to the use of edge-cut partitioning and the need to store bi-directed edges and messages for the connected components algorithmProperty GraphDirected multigraph with user defined objects to each vertex and edge Graph

Triplet ViewLogical join of vertex and edge properties

Graph OperationsBasic information (numEdges, numVertices, inDegrees, ...)Views (vertices, edges, triplets)Caching (persist, cache, ...)Transformation (mapVertices, mapEdges, ...)Structure modification (reverse, subgraph, ...)Neighbour aggregation (collectNeighbours, aggregations, ...)Pregel APIGraph builders (various I/O operations)...Graph AlgorithmsBuilt-in algorithmsPageRank, Connected Components, Triangle Count, ...


Spark StreamingScalable, high-throughput, fault-tolerant stream processing

ArchitectureStreams are chopped up into batchesEach batch is processed in SparkResults pushed out in batches

Streaming Word Count

Streaming Word Count

StreamingContextEntry point for all streaming functionalitydefine input sourcesstream transformationsoutput operations to DStreamsstarts & stops streaming processLimitationsonce started, computations cannot be addedcannot be restartedone active per JVM

Discretized StreamsBasic abstraction, represents a continuous stream of dataDStreamsImplemented as series of RDDs

Stateless TransformationsProcessing of each batch does not depend on previous batchesTransformation is separately applied to every batchMap, flatMap, filter, reduce, groupBy, Combining data from multiple DStreamsJoin, cogroup, union, ...

cogroup - When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.join - When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.union - Return a new DStream that contains the union of the elements in the source DStream and otherDStream.

Stateful TransformationsUse data or intermediate results from previous batches to compute the result of the current batchWindowed operationsact over a sliding window of time periodsUpdateStateByKeymaintain state while continuously updating it with new informationRequire checkpointing

Output OperationsSpecify what needs to be done with the final transformed dataPushing to external DB, printing, If not performed, DStream is not evaluated

Input SourcesBuilt-in support for a number of different data sourcesOften in additional libraries (i.e. spark-streaming-kafka)HDFSAkka Actor StreamApache KafkaApache FlumeTwitter StreamKinesisCustom Sources...Demo

ConclusionRDD repetitionSpark Modules Overview Spark SQLGraphXSpark Streaming Questions