Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

download Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

of 46

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

PowerPoint Presentation

Introduction to SparkScala SB MeetupDecember 18th 2014Maxime DumasSystems Engineer, ClouderaCONFIDENTIAL - RESTRICTED1Thirty Seconds About MaxSystems Engineeraka Sales EngineerSoCal, AZ, NVformer coder of PHPteaches meditation + yogafrom Montreal, Canada2

2What Does Cloudera Do?productdistribution of Hadoop components, Apache licensedenterprise toolingsupporttrainingservices (aka consulting)community

3Similar to the Red Hat model.

Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/


4Quick and dirty, for context.The Apache Hadoop Ecosystem

Were going to breeze through these really quick, just to show how Search plugs in later42014 Cloudera, Inc. All rights reserved.ScalabilitySimply scales just by adding nodesLocal processing to avoid network bottlenecks

EfficiencyCost efficiency ( line.split(" ")) .map(word=>(word,1)) .reduceByKey(_+_).collect()

29Logistic RegressionRead two sets of pointsLooks for a plane W that separates themPerform gradient descent:Start with random WOn each iteration, sum a function of W over the dataMove W in a direction that improves it30Intuition31

Logistic Regression32

Logistic Regression Performance33

34Spark and Hadoop:a Framework within a Framework35


IntegrationStorageResource ManagementMetadataHBaseImpalaSolrSparkMapReduceSystem ManagementData ManagementSupportSecuritySpark StreamingTakes the concept of RDDs and extends it to DStreamsFault-tolerant like RDDsTransformable like RDDsAdds new rolling window operationsRolling averages, etc.But keeps everything else!Regular Spark code works in Spark StreamingCan still access HDFS data, etc.Example use cases: On-the-fly ETL as data is ingested into Hadoop/HDFS. Detecting anomalous behavior and triggering alerts. Continuous reporting of summary metrics for incoming data.

37Micro-batching for on the fly ETL38

What about SQL?39

http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ Fault Recovery RecapRDDs store dependency graphBecause RDDs are deterministic:Missing RDDs are rebuilt in parallel on other nodesStateful RDDs can have infinite lineagePeriodic checkpoints to disk clears lineageFaster recovery timesBetter handling of stragglers vs row-by-row streaming40Why Spark?Flexible like MapReduceHigh performanceMachine learning, iterative algorithmsInteractive data explorationsConcise, easy API for developer productivity


42Demo Time!Log file AnalysisMachine LearningSpark StreamingWhats Next?Download Hadoop!CDH available at www.cloudera.comTry it online: Cloudera LiveCloudera provides pre-loaded VMshttp://tiny.cloudera.com/quickstartvm43

44Preferably related to the talk or not.Questions?

45Thank You!Maxime Dumasmdumas@cloudera.com

Were hiring.