Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Embed Size (px)
Transcript of Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Introduction to SparkScala SB MeetupDecember 18th 2014Maxime DumasSystems Engineer, ClouderaCONFIDENTIAL - RESTRICTED1Thirty Seconds About MaxSystems Engineeraka Sales EngineerSoCal, AZ, NVformer coder of PHPteaches meditation + yogafrom Montreal, Canada2
2What Does Cloudera Do?productdistribution of Hadoop components, Apache licensedenterprise toolingsupporttrainingservices (aka consulting)community
3Similar to the Red Hat model.
Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
4Quick and dirty, for context.The Apache Hadoop Ecosystem
Were going to breeze through these really quick, just to show how Search plugs in later42014 Cloudera, Inc. All rights reserved.ScalabilitySimply scales just by adding nodesLocal processing to avoid network bottlenecks
EfficiencyCost efficiency ( line.split(" ")) .map(word=>(word,1)) .reduceByKey(_+_).collect()
29Logistic RegressionRead two sets of pointsLooks for a plane W that separates themPerform gradient descent:Start with random WOn each iteration, sum a function of W over the dataMove W in a direction that improves it30Intuition31
Logistic Regression Performance33
34Spark and Hadoop:a Framework within a Framework35
IntegrationStorageResource ManagementMetadataHBaseImpalaSolrSparkMapReduceSystem ManagementData ManagementSupportSecuritySpark StreamingTakes the concept of RDDs and extends it to DStreamsFault-tolerant like RDDsTransformable like RDDsAdds new rolling window operationsRolling averages, etc.But keeps everything else!Regular Spark code works in Spark StreamingCan still access HDFS data, etc.Example use cases: On-the-fly ETL as data is ingested into Hadoop/HDFS. Detecting anomalous behavior and triggering alerts. Continuous reporting of summary metrics for incoming data.
37Micro-batching for on the fly ETL38
What about SQL?39
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/ Fault Recovery RecapRDDs store dependency graphBecause RDDs are deterministic:Missing RDDs are rebuilt in parallel on other nodesStateful RDDs can have infinite lineagePeriodic checkpoints to disk clears lineageFaster recovery timesBetter handling of stragglers vs row-by-row streaming40Why Spark?Flexible like MapReduceHigh performanceMachine learning, iterative algorithmsInteractive data explorationsConcise, easy API for developer productivity
42Demo Time!Log file AnalysisMachine LearningSpark StreamingWhats Next?Download Hadoop!CDH available at www.cloudera.comTry it online: Cloudera LiveCloudera provides pre-loaded VMshttp://tiny.cloudera.com/quickstartvm43
44Preferably related to the talk or not.Questions?
45Thank You!Maxime Dumasmdumas@cloudera.com