Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

18
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enabling Apache Zeppelin* and Spark* for Data Science in the Enterprise Bikas Saha @bikassaha *Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

Transcript of Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

Page 1: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enabling Apache Zeppelin* and Spark* for Data Science in the Enterprise

Bikas Saha@bikassaha

*Apache Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Zeppelin and the Hadoop elephant logo are trademarks of the Apache Software Foundation.

Page 2: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

Page 3: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin

Page 4: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin makes Big Data Science Easy to Approach

Zero install – Just connect via a web browser and ready to run Support for multiple execution platforms (Apache Spark, JDBC, Hive…) Support for multiple languages (Scala, SQL, Python…) Support for built-in visualizations Support for reporting Support for sharing and collaborative work

Does NOT have machine learning built-in – that’s where Apache Spark comes in (or your favorite SQL engine Apache Flink/Drill/Hive… and 30+ others)

Page 5: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin for Sharing

Page 6: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

Page 7: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Current Apache Zeppelin and Spark integration

ZeppelinServer

SparkDriver

User

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Page 8: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issue with Secure Data Access

ZeppelinServer

SparkDriver

User 1 Spark

Executor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Zeppelin ServerUser

HDFS

Page 9: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issues with Multi-Tenancy – Fault Tolerance

ZeppelinServer

SparkDriver

Us

er1

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Us

er2

User 1 failure affects User 2

Heavy-weight Spark drivers

Page 10: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architectural Issues with Multi-Tenancy – Privacy

ZeppelinServer

SparkDriver

Us

er1

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Us

er2

User 1 can

access User 2Data

Page 11: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Enterprise Ready Big Data Science

Future Roadmap

Page 12: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Livy Server as a Session Management Service

LivyServer

Remote Spark Driver

Session Remote Context

Interactive REST API

BatchREST API

Standard Spark Batch Job

SparkExecutor

SparkExecutor

SparkExecutor

SparkExecutor

Page 13: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secure Data Access - Solved

ZeppelinServer

LivyInterpreter

User

SparkExecutor

SparkExecutor

LivyServer

Remote Spark Driver

Session

Remote Context

User

HDFS

Page 14: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Multi Tenancy - Solved

ZeppelinServer

LivyInterpreter

LivyServer

Session 1

Us

er1

Us

er2

LivyInterpreter

Session 2

Remote Spark Driver

Remote Context

SparkExecutor

Remote Spark Driver

Remote Context

SparkExecutor

Page 15: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaMaking Big Data Science easy to approach

What are the current issues for the enterprise

Making Apache Zeppelin enterprise ready

Future Roadmap

Page 16: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Near Term Improvements

Session Management Debuggability Unified session for all languages Better visualizations for Machine Learning Support for Spark 2.0

Page 17: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Long Term Improvements

Controlled sharing of sessions for collaboration Data exploration and browsing with metadata Taking the model from training to production

Page 18: Enabling Apache Zeppelin and Spark for Data Science in the Enterprise

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You