Matei Zaharia Spark Community Update. An Exciting Year for Spark May 2013May 2014 developers...

27
Matei Zaharia Spark Community Update

Transcript of Matei Zaharia Spark Community Update. An Exciting Year for Spark May 2013May 2014 developers...

Matei Zaharia

Spark Community Update

An Exciting Year for Spark

May 2013 May 2014

developers contributing

60 200

companies contributing

17 50

total linesof code

49,000 155,000

commercial support

none all major Hadoop distros

Community Growth

Spark 0.6:17 contributors

Feb ‘13Oct ‘12 May‘14Sept ‘13

Spark 0.8:67 contributors

Feb ‘14

Spark 0.9:83 contributors

Spark 1.0:110 contributors

Spark 0.7:31 contributors

Community Growth

Patches0

50

100

150

200

250

MapReduceStormYarnSpark

Lines Added0

50001000015000200002500030000350004000045000

MapReduceStormYarnSpark

Lines Removed0

2000

4000

6000

8000

10000

12000

14000

16000

MapReduceStormYarnSpark

Activity in last 30 days

EventsDecember 2-3, 2013

Talks from 22 organizations

450 attendees

June 30-July 2, 2014

Talks from 50+ organizations

Sign up now!Videos, slides, registration:

spark-summit.org

Users & Presenters

Next-Gen MapReduceInfluential bloggers:

» “Leading successor of MapReduce” – Mike Olson, Cloudera» “Two years ago and last year were about Hadoop; this year is

about Spark” – Derrick Harris, GigaOM» “Just about everybody seems to agree” that … “Spark will be

the replacement of Hadoop MapReduce” – Curt Monash, DBMS2

What’s Happening Next?

Many Features Added to Core…APIs

»Full parity in Java & Python, Java 8 lambda support

Management»High availability, YARN security

Monitoring»Greatly improved UI, metrics

But Most Action Now in LibrariesAn expressive API is good, but even better to call your algorithm in 1 line!

Additions to the Stack

Spark Core

Spark Streamin

greal-time

SharkSQL

MLlibmachine learning

GraphXgraph

SparkSQL

OverviewSpark SQL = Catalyst optimizer framework + implementations of SQL & HiveQL on Spark

Provides native support for executing relational queries (SQL) in Spark

Alpha version in Spark 1.0 Led by another AMP alum: Michael Armbrust

Shark modified the Hive backend to run over Spark, but had two challenges:

»Limited integration with Spark programs»Hive optimizer not designed for Spark

Spark SQL reuses the best parts of Shark:

Relationship to

Borrows• Hive data loading

• In-memory column store

Adds• RDD-aware optimizer

• Rich language interfaces

Hive CompatibilityInterfaces to access data and code inthe Hive ecosystem:

o Support for writing queries in HQLo Catalog info from Hive MetaStoreo Tablescan operator that uses Hive

SerDeso Wrappers for Hive UDFs, UDAFs,

UDTFs

Parquet CompatibilityNative support for reading data in Parquet:

• Columnar storage avoids reading unneeded data.

• RDDs can be written to parquet files, preserving the schema.

Abstraction: SchemaRDDsResilient Distributed Datasets (RDDs) are Spark’s core abstraction.

• Pro: Distributed coarse-grained transformations

• Con: Operations opaque to engine

SchemaRDDs add:

• Awareness of names & types of data stored

• Optimization using database techniques

Examples

Consider a text file filled with people’s names and ages:

Michael, 30Andy, 31Justin Bieber, 19…

Turning an RDD into a Relation

// Define the schema using a case class.

case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.

val people = sc.textFile("people.txt")

.map(_.split(","))

.map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

Querying Using SQL// SQL statements can be run by using the sql method provided

// by sqlContext.

val teenagers = sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs but also

// support normal RDD operations.

// The columns of a row in the result are accessed by ordinal.

val nameList = teenagers.map(t => "Name: " + t(0)).collect()

SQL + Machine Learningval trainingDataTable = sql("""

SELECT e.action, u.age, u.latitude, u.logitude

FROM Users u

JOIN Events e ON u.userId = e.userId""")

// Since sql returns an RDD, the results can be easily used in

MLlib

val trainingData = trainingDataTable.map { row =>

val features = Array[Double](row(1), row(2), row(3))

LabeledPoint(row(0), features)

}

val model = new LogisticRegressionWithSGD().run(trainingData)

Joining Diverse Sourcesval hiveContext = new HiveContext(sc)import hiveContext._

// Data in Hivehql("CREATE TABLE IF NOT EXISTS hiveTable (key INT, val STRING)")hql("LOAD DATA LOCAL INPATH 'kv.txt' INTO TABLE hiveTable")

// Data in existing RDDsval rdd = sc.parallelize((1 to 100).map(i => Record(i, "val" + i)))rdd.registerAsTable("rddTable")

// Data in ParquethiveContext.loadParquetFile("f.parquet").registerAsTable("parqTable")

// Query all sources at once!sql("SELECT * FROM hiveTable JOIN rddTable JOIN parqTable WHERE ...")

Spark SQL in Javapublic class Person implements Serializable { public String getName() {...} public void setName(String name) {...} public int getAge() {...} public void setAge(int age) {...}}

JavaRDD<Person> people = sc.textFile("people.txt").map( line -> { String[] parts = line.split(","); new Person(parts[0], Integer.parseInt(parts[1])); });

JavaSQLContext ctx = new JavaSQLContext(sc)JavaSchemaRDD peopleTable = ctx.applySchema(people, Person.class);

Spark SQL in Pythonfrom pyspark.context import SQLContextsqlCtx = SQLContext(sc)

lines = sc.textFile("people.txt")parts = lines.map(lambda l: l.split(","))people = parts.map(lambda p: {"name": p[0], "age": int(p[1])})

peopleTable = sqlCtx.applySchema(people)peopleTable.registerAsTable("people")

teenagers = sqlCtx.sql( "SELECT name FROM people WHERE age >= 13 AND age <= 19")teenNames = teenagers.map(lambda p: "Name: " + p.name)

Spark SQL ResearchCatalyst framework: compact optimizer based on functional language techniques

»Pattern-matching, fixpoint convergence of rules

Complex analytics: expose and optimize MLlib and GraphX algos in SQL

Learn More

Visit spark.apache.org for the latest Spark news, docs & tutorials

Join us at this year’s Summit:spark-summit.org

OverviewSpark SQL = Catalyst optimizer framework + implementations of SQL & HiveQL on Spark

Provides native support for executing relational queries (SQL) in Spark

Alpha version in Spark 1.0 Led by another AMP alum: Michael Armbrust

Shark modified the Hive backend to run over Spark, but had two challenges:

»Limited integration with Spark programs»Hive optimizer not designed for Spark

Spark SQL reuses the best parts of Shark

Relationship to

Borrows:• Hive data loading

• In-memory column store

Adds:• RDD-aware optimizer

• Rich language interfaces