Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing,...

26
Data Engineering and Streaming Analytics

Transcript of Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing,...

Page 1: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Data Engineering and Streaming Analytics

Page 2: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Welcome and Housekeeping

2

● You should have received instructions on how to participate in the training session

● If you have questions, you can use the Q&A window in Go To Webinar

● The recording of the session will be made available after the event

Page 3: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

About Your Instructor

3

Doug Bateman is Director of Training and Education at Databricks. Prior to this role he was Director of Training at NewCircle.

Page 4: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Apache Spark - Genesis and Open Source

4

Spark was originally created at the AMP Lab at Berkeley. The original creators went on to found Databricks.

Spark was created to address bringing data and machine learning together

Spark was donated to the Apache Foundation to create the Apache Spark open source project

Page 5: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Accelerate innovation by unifying data science, engineering and business

• Original creators of • 2000+ global companies use our platform across big

data & machine learning lifecycle

VISION

WHO WE ARE

Unified Analytics Platform SOLUTION

Page 6: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Apache Spark: The 1st Unified Analytics Engine

Runtime DeltaSpark Core Engine

Big Data ProcessingETL + SQL +Streaming

Machine LearningMLlib + SparkR

Uniquely combined Data & AI technologies

Page 7: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Open Format Based on Parquet

With Transactions

Apache Spark API’s

Introducing Delta Lake

A New Standard for Building Data Lakes

Page 8: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Apache Spark - A Unified Analytics Engine

8

Page 9: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Apache Spark

9

“Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

● Research project at UC Berkeley in 2009● APIs: Scala, Java, Python, R, and SQL● Built by more than 1,200 developers from more than 200

companies

Page 10: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

HOW TO PROCESS LOTS OF DATA?

Page 11: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

M&Ms

11

Page 12: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Spark Cluster

12

One Driver and many Executor JVMs

Page 13: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Data Lakes - A Key Enabler of Analytics

Data Lake

Data Science and ML

• Recommendation Engines• Risk, Fraud, & Intrusion Detection• Customer Analytics• IoT & Predictive Maintenance• Genomics & DNA Sequencing

Page 14: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Data Lake

Data Lake ChallengesUnreliable Low Quality Data

Slow Performance

Data Science and ML

• Recommendation Engines• Risk, Fraud, & Intrusion Detection• Customer Analytics• IoT & Predictive Maintenance• Genomics & DNA Sequencing

> 65% big data projects fail per

Gartner

X

Page 15: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

1. Data Reliability Challenges

Failed production jobs leave data in corrupt state requiring tedious recovery✗

Lack of schema enforcement creates inconsistent and low quality data

Lack of consistency makes it almost impossible to mix appends ands reads, batch and streaming

Page 16: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

2. Performance Challenges

Too many small or very big files - more time opening & closing files rather than reading contents (worse with streaming)

Partitioning aka “poor man’s indexing”- breaks down if you picked the wrong fields or when data has many dimensions, high cardinality columns

No caching - cloud storage throughput is low (S3 is 20-50MB/s/core vs 300MB/s/core for local SSDs)

Page 17: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Databricks DeltaNext-generation engine built on top of Spark

● Co-designed compute & storage

● Compatible with Spark API’s

● Built on open standards (Parquet)

Databricks Delta

Indexes & Stats

Transactional Log

Versioned Parquet Files

Leverages your cloud blob storage

Page 18: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Delta Makes Data Reliable

Delta Table

Transactional Log

Versioned Parquet Files

Streaming

Updates/Deletes

Batch

● ACID Transactions● Schema Enforcement

● Upserts● Data Versioning

Key Features

Reliable data always ready for analytics

Page 19: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Delta Makes Data More Performant

Fast, highly responsive queries at scale

● Compaction● Caching

● Data skipping● Z-ordering

Key Features

Delta Table

Transactional Log

Versioned Parquet Files

Delta EngineI/O & Query

OptimizationsOpen Spark

API’s

Page 20: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

CREATE TABLE ...

USING delta…

dataframe.write.format("delta").save("/data")

Get Started with Delta using Spark APIs

CREATE TABLE ...

USING parquet...

dataframe.write.format("parquet").save("/data")

Instead of parquet... … simply say delta

Page 21: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Using Delta with your Existing Parquet TablesStep 1: Convert Parquet to Delta Tables

CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]

[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]

OPTIMIZE events

WHERE date >= current_timestamp() - INTERVAL 1 day

ZORDER BY (eventType)

Step 2: Optimize Layout for Fast Queries

Page 22: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Upsert/Merge: Fine-grained Updates

MERGE INTO customers -- Delta table

USING updates

ON customers.customerId = source.customerId

WHEN MATCHED THEN

UPDATE SET address = updates.address

WHEN NOT MATCHED

THEN INSERT (customerId, address) VALUES (updates.customerId,

updates.address)

Page 23: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

SELECT count(*) FROM events

TIMESTAMP AS OF timestamp

SELECT count(*) FROM events

VERSION AS OF version

Time Travel

spark.read.format("delta").option("timestampAsOf",

timestamp_string).load("/events/")

INSERT INTO my_table

SELECT * FROM my_table TIMESTAMP AS

OF

date_sub(current_date(), 1)

Reproduce experiments & reports Rollback accidental bad writes

Page 24: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Apple: Threat Detection at Scale with Delta

> 100TB new data/day

> 300B events/day

Data Science

Machine Learning

Databricks Delta

Streaming Refinement Alerts

BEFORE DELTA

● Took 20 engineers; 24 weeks to build

● Only able to analyze 2 week window of data

WITH DELTA

● Took 2 engineers; 2 weeks to build

● Analyze 2 years of batch with streaming data

Detect signal across user, application and network logs; Quickly analyze the blast radius with ad hoc queries; Respond quickly in an automated fashion; Scaling across petabytes of data and 100’s of security analysts

KEYNOTE TALK

Page 26: Streaming Analytics Data Engineering and · “Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”

Questions?

26

Further Training Options: http://bit.ly/DBTrng

● Live Onsite Training● Live Online● Self Paced

Meet one of our Spark experts: http://bit.ly/ContactUsDB