Post on 03-Jul-2020
Data Engineering and Streaming Analytics
Welcome and Housekeeping
2
● You should have received instructions on how to participate in the training session
● If you have questions, you can use the Q&A window in Go To Webinar
● The recording of the session will be made available after the event
About Your Instructor
3
Doug Bateman is Director of Training and Education at Databricks. Prior to this role he was Director of Training at NewCircle.
Apache Spark - Genesis and Open Source
4
Spark was originally created at the AMP Lab at Berkeley. The original creators went on to found Databricks.
Spark was created to address bringing data and machine learning together
Spark was donated to the Apache Foundation to create the Apache Spark open source project
Accelerate innovation by unifying data science, engineering and business
• Original creators of • 2000+ global companies use our platform across big
data & machine learning lifecycle
VISION
WHO WE ARE
Unified Analytics Platform SOLUTION
Apache Spark: The 1st Unified Analytics Engine
Runtime DeltaSpark Core Engine
Big Data ProcessingETL + SQL +Streaming
Machine LearningMLlib + SparkR
Uniquely combined Data & AI technologies
Open Format Based on Parquet
With Transactions
Apache Spark API’s
Introducing Delta Lake
A New Standard for Building Data Lakes
Apache Spark - A Unified Analytics Engine
8
Apache Spark
9
“Unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”
● Research project at UC Berkeley in 2009● APIs: Scala, Java, Python, R, and SQL● Built by more than 1,200 developers from more than 200
companies
HOW TO PROCESS LOTS OF DATA?
M&Ms
11
Spark Cluster
12
One Driver and many Executor JVMs
Data Lakes - A Key Enabler of Analytics
Data Lake
Data Science and ML
• Recommendation Engines• Risk, Fraud, & Intrusion Detection• Customer Analytics• IoT & Predictive Maintenance• Genomics & DNA Sequencing
Data Lake
Data Lake ChallengesUnreliable Low Quality Data
Slow Performance
Data Science and ML
• Recommendation Engines• Risk, Fraud, & Intrusion Detection• Customer Analytics• IoT & Predictive Maintenance• Genomics & DNA Sequencing
> 65% big data projects fail per
Gartner
X
1. Data Reliability Challenges
Failed production jobs leave data in corrupt state requiring tedious recovery✗
Lack of schema enforcement creates inconsistent and low quality data
Lack of consistency makes it almost impossible to mix appends ands reads, batch and streaming
2. Performance Challenges
Too many small or very big files - more time opening & closing files rather than reading contents (worse with streaming)
Partitioning aka “poor man’s indexing”- breaks down if you picked the wrong fields or when data has many dimensions, high cardinality columns
No caching - cloud storage throughput is low (S3 is 20-50MB/s/core vs 300MB/s/core for local SSDs)
Databricks DeltaNext-generation engine built on top of Spark
● Co-designed compute & storage
● Compatible with Spark API’s
● Built on open standards (Parquet)
Databricks Delta
Indexes & Stats
Transactional Log
Versioned Parquet Files
Leverages your cloud blob storage
Delta Makes Data Reliable
Delta Table
Transactional Log
Versioned Parquet Files
Streaming
Updates/Deletes
Batch
● ACID Transactions● Schema Enforcement
● Upserts● Data Versioning
Key Features
Reliable data always ready for analytics
Delta Makes Data More Performant
Fast, highly responsive queries at scale
● Compaction● Caching
● Data skipping● Z-ordering
Key Features
Delta Table
Transactional Log
Versioned Parquet Files
Delta EngineI/O & Query
OptimizationsOpen Spark
API’s
CREATE TABLE ...
USING delta…
dataframe.write.format("delta").save("/data")
Get Started with Delta using Spark APIs
CREATE TABLE ...
USING parquet...
dataframe.write.format("parquet").save("/data")
Instead of parquet... … simply say delta
Using Delta with your Existing Parquet TablesStep 1: Convert Parquet to Delta Tables
CONVERT TO DELTA parquet.`path/to/table` [NO STATISTICS]
[PARTITIONED BY (col_name1 col_type1, col_name2 col_type2, ...)]
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
Step 2: Optimize Layout for Fast Queries
Upsert/Merge: Fine-grained Updates
MERGE INTO customers -- Delta table
USING updates
ON customers.customerId = source.customerId
WHEN MATCHED THEN
UPDATE SET address = updates.address
WHEN NOT MATCHED
THEN INSERT (customerId, address) VALUES (updates.customerId,
updates.address)
SELECT count(*) FROM events
TIMESTAMP AS OF timestamp
SELECT count(*) FROM events
VERSION AS OF version
Time Travel
spark.read.format("delta").option("timestampAsOf",
timestamp_string).load("/events/")
INSERT INTO my_table
SELECT * FROM my_table TIMESTAMP AS
OF
date_sub(current_date(), 1)
Reproduce experiments & reports Rollback accidental bad writes
Apple: Threat Detection at Scale with Delta
> 100TB new data/day
> 300B events/day
Data Science
Machine Learning
Databricks Delta
Streaming Refinement Alerts
BEFORE DELTA
● Took 20 engineers; 24 weeks to build
● Only able to analyze 2 week window of data
WITH DELTA
● Took 2 engineers; 2 weeks to build
● Analyze 2 years of batch with streaming data
Detect signal across user, application and network logs; Quickly analyze the blast radius with ad hoc queries; Respond quickly in an automated fashion; Scaling across petabytes of data and 100’s of security analysts
KEYNOTE TALK
Spark References
25
● Databricks● Apache Spark ML Programming Guide● Scala API Docs● Python API Docs● Spark Key Terms
Questions?
26
Further Training Options: http://bit.ly/DBTrng
● Live Onsite Training● Live Online● Self Paced
Meet one of our Spark experts: http://bit.ly/ContactUsDB