Blurring the line between Developer and Data Scientist with PixieDust

David TaiebSTSM – Watson Data PlatformDeveloper [email protected]

Developer of the RiseBlurring the line between Developer and Data Scientist with PixieDustDEVOXX US 2017, San Jose

©2017 IBM Corporation

Why are you here?–More and more, companies make bet-the-business data driven

decisions about their customers, competitors and products• Good news: they are drowning in Data• Bad news: they are drowning in Data

–Solving the Data problems of tomorrow cannot be done by data scientists alone.

–Decision making has moved from the elite few to the empowered many: Developers (YOU!) are getting more and more involved with Data Science, moving from stovepipe applications to “data pipelines” that integrate data and analytics.


Developers are the “New Builders”“The most successful companies today are those that understand the strategic role that developers will play in their success or failure. Not just successful technology companies – virtually every company today needs a developer strategy. There’s a reason that ESPN and Sears have rolled out API programs, that companies are being bought not for their products but their people. The reason is that developers are the most valuable resource in business.”

https://thenewkingmakers.com/


Let’s start with a story…That we all know too well

DisclaimerAll characters and events depicted in this Story are entirely fictitious. Any similarity to actual use cases, events or persons is actually intentional

How do we blur the lines between developers and data scientists?


Meet Ben “the developer”

• Hold a master degree in computer science• 10 year experience, 6 years with the company• Full stack Web developer• Languages of choice: Java, Node.js, HTML5/CSS3• Data: No SQL (Cloudant, Mongo), relational• Protocols: REST, JSON, MQTT• No major experience with Big data

Favorite Quote:“The Best Line of Code is the One I Didn't Have to

Write!”


Meet Natasha “the data scientist”

• Hold a PHD in data science• 5 year experience, 2 years with the company• Experienced in Python and R• Expert in Machine Learning and Data visualization• Software engineering is not her thing

Favorite Quote:“In God we trust.

All others bring data”W. Edwards Deming


Surprise meeting with VP of development

We have an urgent need for our marketing departmentto build an application that can provide real-time sentiment analysis on Twitter data

courtesy of http://linkq.com.vn/


Key Constraints

• You only have 6 weeks to build the application• Target consumer is the LOB user: must be easy to use even for

non technical people• Web interface: should be accessible from a standard browser

(desktop or mobile)• It must scale out of the box: I want you to look at Apache

Spark


Some learning to do

What exactly is Data Science?


What is Data Science?“Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to Knowledge Discovery

in Databases (KDD).”https://en.wikipedia.org/wiki/Data_science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

https://en.wikipedia.org/wiki/Data_science




Some learning to do

What exactly is Apache Spark?


What is Apache spark ?

Spark is an open sourcein-memory

computing framework for distributed data processing and

iterative analysis on massive data volumes


Spark Core Libraries

general compute engine, handles distributed task dispatching, scheduling and basic I/O functions

Spark SQL

executes SQL statements

performs streaming

analytics using micro-batches

common machine learning and

statistical algorithmsdistributed graph

processing framework

Spark Streaming

MllibMachine Learning

GraphXGraph

Spark Core


Spark is evolved quickly and has strong tractionSpark is one of the most active open source projects

Interest over time (Google Trends)

Source: https://www.google.com/trends/explore#q=apache%20spark&cmpt=q&tz=http://www.indeed.com/jobanalytics/jobtrends?q=apache+spark&l=

Job Trends (Indeed.com)

https://www.google.com/trends/explore


Consuming Spark

15

Spark Application(driver)

Master(cluster Manager)

Worker Node

Worker Node

Worker Node

Worker Node

…

Spark Cluster

Kernel

Master(cluster Manager)

Worker Node

Worker Node

…Spark Cluster

Notebook Server

Browser

Http/WebSockets

Kernel Protocol

Batch Job(Spark-Submit)

Interactive Notebook

• RDD Partitioning• Task packaging and

dispatching• Worker node

scheduling


Ben and Natasha start brainstorming

• I’ll work on data acquisition from Twitter and enrichment with sentiment analysis scores using Spark Streaming

• I know Java very well, but I don’t have time to learn Python.

• However, I am willing to learn Scala if that helps improve my productivity

• I’ll perform the data exploration and analysis

• I know Python and R, but am not familiar enough with Java or Scala

• I like pandas and numpy. I’m ok to learn Spark but expect the same level of apis

• I need to work iteratively with the data

I’ll need to do some data exploration too I’ll need apis to access my data


Can we collaborate using Notebooks?

What exactly is a Notebook?


What is a Notebook

Text, Annotations

Code, Data

Visualizations, Widgets, Output

• Web based UI for running apache spark console commands

• Easy, no install spark accelerator

• Best way to start working with spark


What is Jupyter?

with a “y”, clever ah? Browser

Kernel

Code Output

https://www.bluetrack.com/uploads/items_images/kernel-of-corn-stress-balls1_thumb.jpg?r=1

Look! Apache Toree for Spark

"Open source, interactive data science and scientific computing"

– Formerly IPython– Large, open, growing community and ecosystem

Very popular– “~2 million users for IPython” [1]

– $6m in funding in 2015 [3]

– 200 contributors to notebook subproject alone [4]

– 275,000 public notebooks on GitHub [2]


Notebook hosting optionsLocal Cloud

http://datascience.ibm.com/https://databricks.com/

…

http://localhost:8888/…

http://datascience.ibm.com/

https://databricks.com/

http://localhost:8888/

http://localhost:8888/


Big Data Analysis

Kernel with Apache Spark

support

code

results

Worker

…

Hosted environment

Worker

Worker

Services:Cognitive, …

Libraries:Statistics, Math, Machine

Learning, Plotting,…

Data(flat files, relational database, NoSQL database, …)


Notebooks are powerful tools for data scientists

But they seem complicated for

developers like me


PixieDust

https://github.com/ibm-cds-labs/pixiedust

Developers can:• Visualize my data, without having to learn Matplotlib• Explore my data in an interactive interface• Use Spark without having to learn it• Save my data locally or on the cloud with a simple menu• Extend PixieDust with new visualizations

Data Scientists can:• Use Python and Scala in the same notebook• Share variables between Scala and Python• Access Spark Libraries written in Scala from Python Notebooks• Monitor the progress of my Spark Jobs


Demo

One simple display() call

Results in this interactive chart


Demo

PixieDust can render geo-coded data

Into a nice map using MapBox


What if PixieDust doesn’t have the visualization I want?

Can I write my own?


Demo: PixieDust extensibility APIs

1. Provide you html fragment using Jinja2 templating

2. Specify the metadata

3. Behold the new menu


I’m ok to use Python

But I’m really more comfortable

with Scala…


PixieDust also work with Scala Notebooks

Same PixieDust Scala Apis as in Python


OK, I’m sold on Notebooks, Spark and PixieDust

Watson Tone Analyzer

Input Stream

Enrich data with Emotion Tone Scores

Processed data

Notebook

Let’s now agree on the architecture


Dividing the tasks

• Implement a Spark Streaming connector to Twitter

• Call Watson Tone Analyzer for each tweets

• Return a Spark DataFrame with the tweets enriched with Tone scores

• Code written in Scala, delivered as a Jar

• Works in a Python Notebook• Using PixieDust PackageManager, install

the Scala library delivered by Ben to load the twitter data with Tone scores

• Using PixieDust display() api, perform the data exploration and analysis: trending hashtags and sentiments

• Produce visualizations to LOB Users


A word on Watson Tone Analyzer• Uses linguistic analysis to detect 3 types of tones: Emotion, Social Tendencies and Language styles• Available as a cloud service on IBM Bluemix

Input

http://www.ibm.com/watson/developercloud/tone-analyzer.html

Results


Let’s take a look at the application


Meeting back with the VP

This is great, but C-Suite executives need to be able to run the application from Notebook, select filters and see real-time charts without writing code!


Embedded application with PixieDust plugin


Conclusion• Solving the Data problems of tomorrow

cannot be done by data scientists alone.

• Notebooks, considered by most to be the domain of data scientists, can help break down traditional silos and help team of all types who are working on data problems

• Try it for yourself today:• IBM Data Science Experience:

http://datascience.ibm.com/• Locally using PixieDust automated

installer: https://ibm-cds-labs.github.io/pixiedust/install.html


Resources• https://github.com/ibm-cds-labs/pixiedust • https://ibm-cds-labs.github.io/pixiedust/• https://medium.com/ibm-watson-data-lab/i-am-not-a-data-scientist-efe7ca6ceba2• http://spark.apache.org• www.ibm.com/analytics/us/en/technology/cloud-data-services/spark-as-a-service • http://datascience.ibm.com • www.ibm.com/watson/developercloud/tone-analyzer.html • https://medium.com/ibm-watson-data-lab/real-time-sentiment-analysis-of-twitter-hasht

ags-with-spark-7ee6ca5c1585#.

• https://developer.ibm.com/tv/builders/

https://github.com/ibm-cds-labs/pixiedust

https://ibm-cds-labs.github.io/pixiedust/

http://spark.apache.org/

http://spark.apache.org/

http://www.ibm.com/analytics/us/en/technology/cloud-data-services/spark-as-a-service

http://datascience.ibm.com/

http://www.ibm.com/watson/developercloud/tone-analyzer.html

https://medium.com/ibm-watson-data-lab/real-time-sentiment-analysis-of-twitter-hashtags-with-spark-7ee6ca5c1585

https://medium.com/ibm-watson-data-lab/real-time-sentiment-analysis-of-twitter-hashtags-with-spark-7ee6ca5c1585

https://developer.ibm.com/tv/builders/

Register Now! seti.org/ML4SETI

from The SETI InstituteHackathon

Galvanize, San FranciscoJune 10-11, 2017

Code ChallengeJune/July, 2017

© 2017 IBM Corporation

Mission Badge #6:SMS Text

Your mission should you choose to accept it….

Join us at the IBM Booth for hands-on labs, demos, games and talk to our developers.

Text PixieDust to 41411and learn more about how Notebooks aren’t just for Data Scientists

Enter the raffle by completing missions for a chance to win• a Drone• TJBot Kit • VR glasses


Questions!

David [email protected]@DTAIEB55

courtesy: http://omogemura.com/

Blurring the line between Developer and Data Scientist with PixieDust

Technology

Transcript of Blurring the line between Developer and Data Scientist with PixieDust