What’s New in Spark 0.6 and Shark 0.2

26
What’s New in Spark 0.6 and Shark 0.2 November 5, 2012 UC BERKELEY www.spark-project.org

description

What’s New in Spark 0.6 and Shark 0.2. November 5, 2012. www.spark-project.org. UC BERKELEY. Agenda. Intro & Spark 0.6 tour (Matei Zaharia) Standalone deploy mode (Denny Britz ) Shark 0.2 ( Reynold Xin ) Q & A. What Are Spark & Shark?. - PowerPoint PPT Presentation

Transcript of What’s New in Spark 0.6 and Shark 0.2

Page 1: What’s New in Spark 0.6 and Shark 0.2

What’s New in Spark 0.6 and Shark 0.2November 5, 2012

UC BERKELEYwww.spark-project.org

Page 2: What’s New in Spark 0.6 and Shark 0.2

AgendaIntro & Spark 0.6 tour (Matei Zaharia)Standalone deploy mode (Denny Britz)Shark 0.2 (Reynold Xin)Q & A

Page 3: What’s New in Spark 0.6 and Shark 0.2

What Are Spark & Shark?Spark: fast cluster computing engine based on general operators & in-memory computingShark: Hive-compatible data warehouse system built on Spark

Both are open source projects from the UCBerkeley AMP Lab

Page 4: What’s New in Spark 0.6 and Shark 0.2

What is the AMP Lab?60-person lab focusing on big dataFunded by NSF, DARPA, 18 companiesGoal: build an open-source, next-generation analytics stack

UC BERKELEY Spark

Mesos

Shark Stre

ami

ngGr

aph

Hado

op, M

PI. .

.

. . .

Lear

nin

g

Page 5: What’s New in Spark 0.6 and Shark 0.2

Some Exciting NewsRecently, three full-time developers joined AMP to work on these projectsAlso encourage outside contributions!

»This release: Shark server (Yahoo!), improved accumulators (Quantifind)

Page 6: What’s New in Spark 0.6 and Shark 0.2

Spark 0.6 ReleaseBiggest release so far in terms of featuresBiggest in terms of developers (18 total, 12 new)Focus areas: ease-of-use and performance

Page 7: What’s New in Spark 0.6 and Shark 0.2

Ease-of-UseSpark already had good traction despite two fairly researchy aspects

»Scala language»Requirement to run on Mesos

A big goal was to improve these:»Java API (and upcoming API in Python)»Simpler deployment (standalone mode,

YARN)

Page 8: What’s New in Spark 0.6 and Shark 0.2

Java APIlines.filter(_.contains(“error”)).count()

JavaRDD<String> lines = sc.textFile(...);

lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Page 9: What’s New in Spark 0.6 and Shark 0.2

Java API FeaturesSupports all existing Spark features

»RDDs, accumulators, broadcast variables

Retains type safety through specific classes for RDDs of special types

»E.g. JavaPairRDD<K, V> for key-value pairs

Page 10: What’s New in Spark 0.6 and Shark 0.2

Using Key-Value Pairsimport scala.Tuple2;

JavaRDD<String> words = ...;

JavaPairRDD<String, Integer> ones = words.map( new PairFunction<String, String, Integer> { public Tuple2<String, Integer> call(String s) { return new Tuple2(s, 1); } });

// Can now call ones.reduceByKey(), groupByKey(), etc

More info: spark-project.org/docs/0.6.0/

Page 11: What’s New in Spark 0.6 and Shark 0.2

Coming Next: PySparklines = sc.textFile(sys.argv[1])

counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)

Page 12: What’s New in Spark 0.6 and Shark 0.2

Simpler DeploymentRefactored Spark’s scheduler to allow running on different cluster managersDenny will talk about the standalone mode…

Page 13: What’s New in Spark 0.6 and Shark 0.2

Other Ease-of-Use WorkDocumentation

»Big effort to improve Spark’s help and Scaladoc

Debugging hints (pointers to user code in logs)Maven Central artifacts

spark-project.org/documentation.html

Page 14: What’s New in Spark 0.6 and Shark 0.2

PerformanceNew ConnectionManager and BlockManager

»Replace simple HTTP shuffle with faster, async NIO

Faster control-plane (task scheduling & launch)Per-RDD control of storage level

Page 15: What’s New in Spark 0.6 and Shark 0.2

Some Graphs

020406080

100120

Spark 0.5

Runn

ing

tim

e (m

inut

es)

Large User App(2000 maps / 1000 reduces)

0100200300400500600700800900

1000Spark 0.5

Runn

ing

tim

e (m

s)

Wikipedia Search Demo

Page 16: What’s New in Spark 0.6 and Shark 0.2

Per-RDD Storage Levelimport spark.storage.StorageLevelval data = file.map(...)

// Keep in memory, recompute when out of space// (default behavior with cache())data.persist(StorageLevel.MEMORY_ONLY)

// Drop to disk instead of recomputingdata.persist(StorageLevel.MEMORY_AND_DISK)

// Serialize in-memory datadata.persist(StorageLevel.MEMORY_ONLY_SER)

Page 17: What’s New in Spark 0.6 and Shark 0.2

CompatibilityWe’ve always strived to stay source-compatible!Only change in this release is in configuration: spark.cache.class replaced with per-RDD levels

Page 18: What’s New in Spark 0.6 and Shark 0.2
Page 19: What’s New in Spark 0.6 and Shark 0.2

Shark 0.2Hive compatibility improvementsThrift server modePerformance improvementsSimpler deployment (comes with Spark 0.6)

Page 20: What’s New in Spark 0.6 and Shark 0.2

Hive CompatibilityHive 0.9 supportFull UDF/UDAF supportADD FILE support for running scriptsUser-supplied jars using ADD JAR

Page 21: What’s New in Spark 0.6 and Shark 0.2

Thrift ServerContributed by Yahoo!, compatible with Hive Thrift serverEnable multiple clients share cached tablesBI tool integration (e.g. Tableau)

Page 22: What’s New in Spark 0.6 and Shark 0.2

Performance

010203040506070

Shark 0.1

Runn

ing

Tim

e (s

ecs)

Group By(1B items, 150M distinct)

0

50

100

150

200

250Shark 0.1

Runn

ing

Tim

e (s

ecs)

Join(1B join 150M)

Page 23: What’s New in Spark 0.6 and Shark 0.2

Shark 0.3 PreviewIn-memory columnar compression (dictionary encoding, run length encoding, etc)Map pruningJVM bytecode generation for expression evalsPersist cached table meta data across sessions

Page 24: What’s New in Spark 0.6 and Shark 0.2

Spark 0.7+Spark StreamingPySpark: Python API for SparkMemory monitoring dashboard

Page 25: What’s New in Spark 0.6 and Shark 0.2
Page 26: What’s New in Spark 0.6 and Shark 0.2