Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All...

24
1 © Cloudera, Inc. All rights reserved. Transforming Analytics with Cloudera Data Science WorkBench Process data, develop and serve predictive models.

Transcript of Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All...

Page 1: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

1© Cloudera, Inc. All rights reserved.

Transforming Analytics with Cloudera Data Science WorkBench

Process data, develop and serve predictive models.

Page 2: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

2© Cloudera, Inc. All rights reserved.

Age of Machine Learning

2

Cost of compute

Data volume

Time

MachineLearning

NOMachineLearning

1950s 1960s 1970s 1980s 1990s 2000s 2010s

Page 3: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

3© Cloudera, Inc. All rights reserved.

Our current platform

OPERATIONSCloudera Manager

Cloudera Director

DATA MANAGEMENT

Cloudera Navigator

Encrypt and KeyTrustee

Optimizer

STRUCTUREDSqoop

UNSTRUCTUREDKafka, Flume

PROCESS, ANALYZE, SERVE

UNIFIED SERVICES

RESOURCE MANAGEMENTYARN

SECURITYSentry, RecordService

STORE

INTEGRATE

BATCHSpark, Hive, Pig

MapReduce

STREAMSpark

SQLImpala

SEARCHSolr

OTHERKite

NoSQLHBase

OTHERObject Store

FILESYSTEMHDFS

RELATIONALKudu

Page 4: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

4© Cloudera, Inc. All rights reserved.

Apache SparkDe facto Data Processing and Modern Analytic Engine

Page 5: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

5© Cloudera, Inc. All rights reserved.

Apache SparkFast and flexible general purpose data processing for Hadoop

Data Engineering

Stream Processing

Data Science & Machine Learning

Unified API and processing Engine for large scale data

Page 6: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

6© Cloudera, Inc. All rights reserved.

Spark Addresses Common Limitations

Access and UsabilityOne of the key advantages of Apache Spark is the intuitive and flexible API for big-data processing, available in popular programming languages. Prior to Apache Spark, users had access to very limited inflexible abstractions for processing large distributed data, with poor support outside Java.

Data Processing PerformanceMapReduce made big strides in enabling cost effective batch processing of large volumes of data. However businesses continue to see a need to shorten data processing windows and consume data faster, requiring a new framework with significantly better performance.

Machine Learning at ScaleData Science and Machine Learning on big-data are exciting areas of focus. However that requires libraries and that enable building models on large distributed data and APIs that allow flexible exploration of data.

Page 7: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

7© Cloudera, Inc. All rights reserved.

Apache Spark

Apache Spark is at the core of our data science

experience

• Libraries for common machine learning

• Trusted in production by our customers

• Delivered with expert support and training

• A requirement for our Data Science Workbench

Apache Spark is a huge driver for machine

learning

• Native language development tools

• Reliable operation at big data scale

• Native access to Hadoop data for testing and training

Spark 2.1 is here

• Separate parcel for easy implementation for multiple Spark instances

• Better Streaming Performance

• Machine Learning Persistence

Page 8: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

8© Cloudera, Inc. All rights reserved.

Machine Learning

Page 9: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

9© Cloudera, Inc. All rights reserved.

Machine Learning on Hadoop

Raw Data- many

sources- many

formats- varying

validity

Validated ML Models

End User

Data Engineering

Data Science

Well-formated data

Training, validation, and test data

cleaning

merging

filtering

model building

model training

hyper-paramtuning

pipeline execution

production operation

Data Engineering

Consump-tion for analysis

ongoing data ingestion

Page 10: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

10© Cloudera, Inc. All rights reserved.

Machine Learning Deployment Patterns

• Build in Notebooks

• Train on CDH (Spark ML)

• Deliver on transactional systems or run batch

• Build on CDH (Workbench)

• Train of CDH (Spark ML)

• Deliver on transactional systems

• Build on CDH (Workbench)

• Train on CDH (Spark ML)

• Deliver on CDH (Hbase/Kudu/Spark Streaming)

Train Build and Train Build, Train, and Serve

Page 11: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

11© Cloudera, Inc. All rights reserved.

Apache Spark MLlibCollection of mainstream machine learning algorithms built on Spark

Including:

• Classifiers: logistic regression, boosted trees, random forests, etc

• Clustering: k-means, Latent Dirichlet Allocation (LDA)

• Recommender Systems: Alternating Least Squares

• Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)

• Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc

• Statistical Functions: Chi-Squared Test, Pearson Correlation, etc

Page 12: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

12© Cloudera, Inc. All rights reserved.

Cloudera Data ScienceSelf-Service Data Science for the Enterprise

Page 13: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

13© Cloudera, Inc. All rights reserved.

• Team: Data scientists and analysts• Goal: Understand data, develop and improve models,

share insights

• Data: New and changing; often sampled• Environment: Local machine, sandbox cluster• Tools: R, Python, SAS/SPSS, SQL; notebooks; data

wrangling/discovery tools, …• End State: Reports, dashboards, PDF, MS Office

• Team: Data engineers, developers, SREs• Goal: Build and maintain applications, improve

model performance, manage models in production

• Data: Known data; full scale• Environment: Production clusters• Tools: Java/Scala, C++; IDEs; continuous

integration, source control, …• End State: Online/production applications

Types of data science

Exploratory(discover and quantify opportunities)

Operational(deploy production systems)

Page 14: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

14© Cloudera, Inc. All rights reserved.

https://medium.com/@KevinSchmidtBiz/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc

Page 15: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

15© Cloudera, Inc. All rights reserved.

Common Limitations

AccessMany times secured clusters are hard for data science professionals to connect either because they don’t have the right permissions or resources are to scarce to afford them access. In addition popular frameworks and libraries don’t read Hadoop data formats out-of-the-box.

ScaleNotebook environments seldom have large enough data storage for medium, let alone big data. Data scientists are often relegated to sample data and constrained when working on distributed systems. Popular frameworks and libraries don’t easily parallelize across the cluster.

Developer ExperiencePopular notebooks don’t work well with access engines like Spark and package deployment and dependency management across multiple software versions is often hard to manage. Then once a model is built there is no easy path from model development to production

Page 16: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

16© Cloudera, Inc. All rights reserved.

Open data science in the enterprise

ITdrive adoption while maintaining compliance

Data Scientistexplore, experiment, iterate

Page 17: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

17© Cloudera, Inc. All rights reserved.

Page 18: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

18© Cloudera, Inc. All rights reserved.

Solving Data Science is a Full-Stack Problem

• Leverage Big Data

• Enable real-time use cases

• Provide sufficient toolset for the Data Analysts

• Provide sufficient toolset for the Data Scientists + Data Engineers

• Provide standard data governance capabilities

• Provide standard security across the stack

• Provide flexible deployment options

• Integrate with partner tools

• Provide management tools that make it easy for IT to deploy/maintain

✓Hadoop

✓Kafka, Spark Streaming

✓Spark, Hive, Hue

✓Data Science Workbench (beta)

✓Navigator + Partners

✓Kerberos, Sentry, Record Service, KMS/KTS

✓Cloudera Director

✓Rich Ecosystem

✓Cloudera Manager/Director

Page 19: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

19© Cloudera, Inc. All rights reserved.

Data Science WorkbenchSelf-service data science for the enterprise

Page 20: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

20© Cloudera, Inc. All rights reserved.

Introducing Cloudera Data Science WorkbenchSelf-service data science for the enterprise

Accelerates data science from development to production with:

• Secure self-service environments for data scientists to work against Cloudera clusters

• Support for Python, R, and Scala, plus project dependency isolation for multiple library versions

• Workflow automation, version control, collaboration and sharing

Page 21: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

21© Cloudera, Inc. All rights reserved.

Key BenefitsHow is Cloudera Data Science different?

Works with fully secured clusters

One tool for multiple languages (Python, R, Scala)

Multi-tenant Architecture

Common Platform

Page 22: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

22© Cloudera, Inc. All rights reserved.

Security, Lineage and Governance

Ingestion

Flume/Sqoop/

Kafka

Analytics

Hive/Impala/S

park/Search

ML

spark.mllib

Deep

Learning

Frameworks

HDFS

Session A

Session B

Session N

Cloudera Manager

Page 23: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

23© Cloudera, Inc. All rights reserved.

How does CDSW help!

Visu

alize results

Ch

ange an

d C

om

pile So

urce

cod

e

Retrain

and

rede

plo

y

Extensib

le Engin

es

Co

nfigu

rable Se

ssion

s

Trivial to tw

eak param

eters

Mu

ltiple U

sers

Roles/Governance

CDH

Page 24: Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

24© Cloudera, Inc. All rights reserved.

Thank You