An Introduct to Spark - Atlanta Spark Meetup

download An Introduct to Spark - Atlanta Spark Meetup

of 85

  • date post

    14-Jul-2015
  • Category

    Technology

  • view

    525
  • download

    0

Embed Size (px)

Transcript of An Introduct to Spark - Atlanta Spark Meetup

Data Engineering for Data Scientists

Apache Spark, an IntroductionJonathan Lacefield Solution ArchitectDataStaxDisclaimerThe contents of this presentation represent my personal views and do not reflect or represent any views of my employer.

This is my take on Spark.

This is not DataStaxs take on Spark.NotesMeetup Sponsor: Data Exchange Platform Core Software Engineering Equifax Announcement: Data Exchange Platform is currently hiring to build the next generation data platform. We are looking for people with experience in one or more of the following skills: Spark, Storm, Kafka, samza, Hadoop, Cassandra How to apply? Email aravind.yarram@equifax.comIntroductionJonathan LacefieldSolutions Architect, DataStaxFormer Dev, DBA, Architect, reformed PMEmail: jlacefie@gmail.comTwitter: @jlacefieLinkedIn: www.linkedin.com/in/jlacefield

This deck represents my own views and not the views of my employer

3 mins4

DataStax IntroductionDataStax delivers Apache Cassandra in a database platform purpose built for the performance and availability demands of IOT, web, and mobile applications, giving enterprises a secure always-on database that remains operationally simple when scaled in a single datacenter or across multiple datacenters and clouds.

IncludesApache CassandraApache SparkApache SOLRApache HadoopGraph Coming Soon10 mins5DataStax, What we Do (Use Cases)Fraud DetectionPersonalizationInternet of ThingsMessagingLists of Things (Products, Playlists, etc)Smaller set of other things too!

We are all about working with temporal data sets at large volumes with high transaction counts (velocity).15 mins6AgendaSet Baseline (Pre Distributed Days and Hadoop)Spark Conceptual IntroductionSpark Key Concepts (Core)Spark Look at Each ModuleSpark SQLMLIBSpark StreamingGraphXIn the Beginning.

OLTPWeb Application TierOLAPStatistical/Analytical ApplicationsETL

Data Requirements Broke the ArchitectureAlong Came Hadoop with .

Map Reduce

Lifecycle of a MapReduce Job

But.

Started in 2009 in Berkleys AMP LabOpen Sources in 2010Commercial Provider is Databricks http://databricks.comSolve 2 Big Hadoop Pain PointsSpeed - In Memory and Fault TolerantEase of Use API of operations and datasets

Use Cases for Apache SparkData ETLInteractive dashboard creation for customersStreaming (e.g., fraud detection, real-time video optimization)Complex analytics (e.g., anomaly detection, trend analysis)Key Concepts - CoreResilient Distributed Datasets (RDDs) Sparks datasetsSpark Context Provides information on the Spark environment and the applicationTransformations - Transforms dataActions - Triggers actual processingDirected Acyclic Graph (DAG) Sparks execution algorithmBroadcast Variables Read only variables on WorkersAccumulators Variables that can be added to with an associated function on WorkersDriver - Main application container for Spark ExecutionExecutors Execute tasks on dataResource Manager Manages task assignment and statusWorker Execute and CacheResilient Distributed Datasets (RDDs)Fault tolerant collection of elements that enable parallel processingSparks Main AbstractionTransformation and Actions are executed against RDDsCan persist in Memory, on Disk, or bothCan be partitioned to control parallel processingCan be reusedHUGE Efficiencies with processing

RDDs - ResilientSource databricks.comHDFS FileFiltered RDDMapped RDDfilter(func = someFilter())map(func = someAction(...))RDDs track lineage information that can be used to efficiently recompute lost dataRDDs - Distributed

Image Source - http://1.bp.blogspot.com/-jjuVIANEf9Y/Ur3vtjcIdgI/AAAAAAAABC0/-Ou9nANPeTs/s1600/p1.pnRDDs From the APIval someRdd = sc.textFile(someURL)Create an RDD from a text file

val lines = sc.parallelize(List("pandas", "i like pandas")) Create an RDD from a list of elements

Can create RDDs from many different sourcesRDDs can, and should, be persisted in most caseslines.persist() or lines.cache()See here for more infohttp://spark.apache.org/docs/1.2.0/programming-guide.htmlTransformationsCreate one RDD and transform the contents into another RDDExamplesMapFilterUnionDistinctJoinComplete list - http://spark.apache.org/docs/1.2.0/programming-guide.htmlLazy executionTransformations arent applied to an RDD until an Action is executedinputRDD = sc.textFile("log.txt")errorsRDD = inputRDD.filter(lambda x: "error" in x)ActionsCause data to be returned to driver or saved to outputCause data retrieval and execution of all Transformations on RDDsCommon ActionsReduceCollectTakeSaveAs.Complete list - http://spark.apache.org/docs/1.2.0/programming-guide.htmlerrorsRDD.take(1)Example Appimport sysfrom pyspark import SparkContext

if __name__ == "__main__": sc = SparkContext( local, WordCount, sys.argv[0], None)

lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda s: s.split( )) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda x, y: x + y)

counts.saveAsTextFile(sys.argv[2])

Based on source from databricks.com123Conceptual RepresentationRDDRDDRDDRDDTransformationsActionValuecounts = lines.flatMap(lambda s: s.split( )) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda x, y: x + y)counts.saveAsTextFile(sys.argv[2])lines = sc.textFile(sys.argv[1])Based on source from databricks.com123Spark Execution

Image Source Learning Spark http://shop.oreilly.com/product/0636920028512.doDemoVia the REPL

Spark SQLAbstraction of Spark API to support SQL like interaction

ParseAnalyzeLogical PlanOptimizeSpark SQLHiveQLPhysical PlanExecuteCatalystSQL CoreProgramming Guide - https://spark.apache.org/docs/1.2.0/sql-programming-guide.htmlUsed for code source in examplesCatalyst - http://spark-summit.org/talk/armbrust-catalyst-a-query-optimization-framework-for-spark-and-shark/SQLContext and SchemaRDDval sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD. import sqlContext.createSchemaRDD

SchemaRDD can be createdUsing reflection to infer schema Structure from an existing RDDProgrammable interface to create Schema and apply to an RDDSchemaRDD Creation - Reflection// sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.import sqlContext.createSchemaRDD

// Define the schema using a case class.// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,// you can use custom classes that implement the Product interface.case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age