Learning spark ch01 - Introduction to Data Analysis with Spark
Transcript of Learning spark ch01 - Introduction to Data Analysis with Spark
CHAPTER 01 : INTRODUCTION TO DATA ANALYSIS WITH SPARK
Learning Sparkby Holden Karau et. al.
Overview: Introduction to Data Analysis with SPARK
What Is Apache Spark? A Unified Stack
Spark Core Spark SQL Spark Streaming MLlib GraphX
Cluster ManagersWho Uses Spark, and for What?
Data Science Tasks Data Processing Applications
A Brief History of Spark Spark Versions and Releases Storage Layers for Spark
1.1 What Is Apache Spark?
Apache Spark is a cluster computing platform Spark extends MapReduce model to support
Different computations batch applications, iterative algorithms, interactive queries, and streaming
Run computations in memory Highly Accessible
simple APIs in Python, Java, Scala, and SQL rich built-in libraries accessing Hadoop Clusters/Data
Sources
Edx and Coursera Courses
Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala
1.2 A Unified Stack
1.2.1 A Unified Stack: Core, SQL, Streaming
Spark Core Task Scheduling Memory management Fault recovery Storage system interaction API that defines resilient Distributed Dataset (RDD)
Spark SQL Provide SQL interface to Spark Allow programmatic data manipulations mix with SQL
Spark Streaming Enables processing of live stream data e.g. web logs
1.2.2 A Unified Stack: MLlib, GraphX, ClusterM
MLlib Contains common machine learning (ML) modules Classification, Regression, Clustering, Collaborative
Filtering Model evaluation, Data Import, Lower-level ML
primitivesGraphX
Extends Spark RDD APIs just like Spark SQL/Streaming
Contains graph algorithmsCluster Managers
Hadoop YARN, Apache Mesos Default: Standalone scheduler
1.3 Who Uses Spark, and for What ?
General-purpose framework for cluster computing Data Scientists Engineers
Data Scientists Analyze and Model data SQL, Statistics, Predictive Model (ML) using Python, R Use Cases: Interactive shells with Python, Scala, SparkSQL
supporting MLlib libraries calling out Matlab/REngineers
Data Processing Applications Principles of SW engineering (Encapsulation, OOP,
Interface design)
1.4 A Brief History of Spark
2009: UC Berkeley RAD lab became AMPlab Start with Hadoop MapReduce was inefficient for interactive
computing jobs designed for interactive and iterative query performance
In-memory storage Efficient fault recovery 10-20X times faster than MapReduce
Early Adopters Spark PoweredBy page Spark Meetups Spark Summit
2011 Berkeley Data Analytics Stacks (BDAS)
1.5 Spark Versions and Releases
May 2014 Spark 1.1.0April 2015 Spark 1.3.1 Spark Documentation
1.6 Storage Layers for Spark
Spark can create distributed datasets from HDFS Supported by Hadoop API
Local Filesystem Amazon S3 Cassandra Hive Hbase …etc
Supports others Text file Sequence file Arvo Parquet Hadoop InputFormat
Learn More about Apache Spark