Boost your Cloud Data & Analytics

35
Boost your Cloud Data & Analytics Avoiding vendor lock-ins with Multi Cloud Data Lakes Big Data & AI in Finance, Banking & Insurance / 5th May 2021 Toma Buchinsky / Slavomir Krivak / Prashant Gangwar

Transcript of Boost your Cloud Data & Analytics

Page 1: Boost your Cloud Data & Analytics

Boost your Cloud Data & AnalyticsAvoiding vendor lock-ins with Multi Cloud Data Lakes

Big Data & AI in Finance, Banking & Insurance / 5th May 2021

Toma Buchinsky / Slavomir Krivak / Prashant Gangwar

Page 2: Boost your Cloud Data & Analytics

Presenters

Slavomir KrivakBig Data & Cloud Architect

[email protected]

Prashant GangwarIT Consultant

[email protected]

Toma BuchinskyCEO Adastra Germany & Data Architect

[email protected]

Page 3: Boost your Cloud Data & Analytics

01Introduction

02Getting started with Cloud Data & Analytics

03Avoiding Cloud Vendor Lock-ins

04Case Study 1: Multi-Cloud D&A with Azure DevOps

05Case Study 2: Portable D&A Platform Spark & Kubernetes

Agenda

Page 4: Boost your Cloud Data & Analytics

Our Solution Portfolio

ARTIFICIALINTELLIGENCE

Machine Learning

Deep Learning

Statistical Analysis

Text Mining

Exploratory Data Analysis

Visual Analytics

Feature Engineering

CLOUD SERVICES

Readiness Assessment

Cloud Provider Evaluation

Cloud Migration

Managed Services

Azure, AWS, GCP

DIGITAL BUSINESS

Digital Transformation

Robotic Process Automation (RPA)

Internet of Things (IoT)

Blockchain

Mobile Apps

GOVERNANCEDATA ENGINEERING

Data Strategy

Modern Data Warehousing

Data Lake

Data Integration

Data Visualization

Business Intelligence

Data Governance

Data Quality

Master/Reference Data Management

Data Lineage

Metadata Management

Page 5: Boost your Cloud Data & Analytics

Adastra Group worldwide

2000+More than 2000 professionals

1000+Projects in 46 countries

20Offices in 10 countries

Frankfurt Toronto Detroit Bratislava Prague London Sofia Moscow Bangkok SydneyWolfsburg Vancouver Stamford S K CZ U K Varna R U T H AUMunich CA U S PlovdivHannover BGMagdeburgDarmstadtDE

Page 6: Boost your Cloud Data & Analytics

Good news is:

German top managers want to transform their companies to

data-driven businesses.

6

Page 7: Boost your Cloud Data & Analytics

7

Traditional DWHOn-prem

Data Lakes

Age of cloud-first programs

Page 8: Boost your Cloud Data & Analytics

8

POWER BIENTERPRISE

DATASETS(DAX)

SOURCES INGEST RAW ANALYZE PRESENT

AZURE

AZURE STRUCTURED ANALYTICS PATH

AZURE ADVANCED ANALYTICS PATH

IOTDATA

AZURE IOT HUB

MANAGED FILE TRANSFER

AZURE ACTIVE DIRECTORY

FILE

ACTIVE DIRECTORY

ON-PREMISE

AZURE SYNAPSE ANALYTICS

SQL PROVISIONED(SQL)

ADVANCED ANALYTIC MODELS

AZURE COGNITIVE

SERVICES

AZURE DATA LAKE

STORAGE

STRUCTURED

(orc, parquet, csv, ..)

STAGE

CURATED

SAASSOURCE

CLOUD

POWER BI(dashboards,

reports, portal)

DATA GOVERNANCE

INTEGRATEDATA

WAREHOUSE

SEMI STRUCTURED

(json, xml, ...)

UNSTRUCTURED

(image, wav, doc, ...)

POLYBASE

STRUCTUREDANALYTICRESULTS

AZURE SYNAPSEANALYTICS

SPARK PROVISIONED(PYTHON, R, SCALA, .NET)

(TRANSACTIONAL DELTA LAKE)

WORKSPACE NOTEBOOKSADVANCEDANALYTICRESULTS

AZURE ML

SERVICES

CUSTOMMODELS

DBAZURE

LOGICAPPS

AZURE DATA FACTORY

AZURE LOGICAPPS

SYNAPSE WORKSPACE

AZURE STACK EDGE

EVENT HANDLING

BATCH LOADS

IOT STREAMING

MASTER DATA REFERENCE DATA DATA CLEANSING DATA LINEAGE

AZURE SYNAPSE STUDIODATA DEVELOPMENT DATA EXPLORATION DATA ACCESS

ANALYST ACCESS

LOCALCOMPUTE

AZURE SYNAPSE ANALYTICSSQL ON DEMAND DOWNSTREAM

TARGETS

EVENT

ADF IR

DATA GW

DATABRICKS(data science)

STREAMANALYTICS

AZURE SYNAPSE ANALYTICSSPARK ON DEMAND

AZURE COSMOSDB

AZURE SQL DATABASE

TRANSACTIONAL

SYNAPSELINK

What do large cloud providers sell?

Page 9: Boost your Cloud Data & Analytics

…expecting euphoric clients.

9

Page 10: Boost your Cloud Data & Analytics

Clients: reluctance and skepticism

⁄ Uncertainty regarding IT security and data protection

⁄ Concerns due to US Cloud Act

⁄ Lack of know-how

⁄ Complexity and costs of migration

⁄ Large investments in DWHs and data lakes in the recent past make it difficult to sell new, expensive initiatives to CFOs

⁄ Reluctance due to possible vendor lock-ins

10

Page 11: Boost your Cloud Data & Analytics

Cloud adoption: Where to start?

Don’ts:

⁄ Big Bang Approach: all at once

⁄ Do nothing and ignore Cloud

⁄ Long & costly conceptual phase

11

Page 12: Boost your Cloud Data & Analytics

Cloud adoption: Where to start?

Dos:

⁄ Quick identification of a suitable existing use case

⁄ Approval by IT security and data protection departments

⁄ Quick upload of relevant data (subsets) to the cloud storage (e.g. as CSV or JSON)

⁄ Data access for analysts and data scientists

⁄ Deplyoment of the first MVP after a maximum of 3 months

12

Traditional DWH

Your BI tool

Persistentdata storage

Interactive Query Services ,make data SQL visible!No data storage Data Scientists

Analysts

Cloud Storage

Data (csv, json, parquet)

SQLEngine

Data Science

Notebook

Public Cloud

Page 13: Boost your Cloud Data & Analytics

Interactive Query Services

14

Azure Synapse AnalyticsSQL on-demand

Page 14: Boost your Cloud Data & Analytics

Interactive Query Services

Advantages

⁄ Quick and easy setup

⁄ Data stays in the cheap cloud object storage

⁄ Data could be accessed through widely used SQL

⁄ Support of different data formats (csv, json, parquet etc.)

⁄ Simple cost structure - price is based on the volume of requested data

Disadvantages

⁄ Poor performance with complex queries and large amounts of data

⁄ Limitation for competing queries

⁄ No user defined functions

15

Azure Synapse AnalyticsSQL on-demand

Page 15: Boost your Cloud Data & Analytics

16

TraditionalDWH

Cloud Storage

Data (csv, json, parquet)

Analysts Data Scientists

QuellenQuellenSources

Complex Data transformations orML Model Training

Next step: Cloud Data Processing& Analytical Data Stores

Data Science

NotebookAnalytical Data

Store

Public Cloud

Page 16: Boost your Cloud Data & Analytics

What to learn from the first Cloud MVPs?

Get a first impression of…

⁄ how fast new data & analytics applications can be implemented in the cloud

⁄ how Cloud Consumption costs develop and how I get a grip on them (therefore MVP, not only PoC)

⁄ if necessary, whether my legacy system can be brought into the cloud fast, with manageable costs and ideally automatically?

17

Page 17: Boost your Cloud Data & Analytics

Cloud Data Lake & Modern DWH

Sources (Structured data)

Sources (Semi-structured data)

Data Storage

DataProcessing

Analytical Data Store

(or)

Query Engine

Analytics & Reporting

Orchestration & Monitoring

Data Governance

Users / Downstream

systems

Machine Learning

DevOps

Page 18: Boost your Cloud Data & Analytics

Do you have to use cloud-native technologies/tools only?

19

No!

Page 19: Boost your Cloud Data & Analytics

Cloud Analytical Data Stores

Cloud-native:

20

Amazon Redshift MS Azure Provisioned SQL (SQLDW)

Alternatives:

Page 20: Boost your Cloud Data & Analytics

Analytical Data Stores: avoiding lock-ins

⁄ Considering the analytical data store as an SQL engine box that executes analytical queries ...

→with acceptable performance→at an acceptable price→with the necessary elasticity

⁄ Transform/aggregate data in advance (e.g. with Apache SPARK or ETL tools)

⁄ Avoid platform-native functionalities such as stored procedures etc.

21

Page 21: Boost your Cloud Data & Analytics

Cloud Data Processing

Cloud-native:

22

Alternatives:

AWS EMR AWS GlueApache Spark

Jobs

Apache Spark In Azure Synapse Analytics or

Azure Databricks

Data Integration Toolsavailable in the Cloud

Fully integrated data management platforms

Apache Spark / Hive / Apache NiFiJobs

Apache Spark in Google Cloud

Dataproc

Best-of-Breed Portable Data & Analytics Plattform

as Kubernetes Cluster

Page 22: Boost your Cloud Data & Analytics

Cloud Data Processing: avoiding lock-ins

⁄ Pay attention to portability already during the design phase

⁄ Definition of development guidelines that ensure the development of portable code

⁄ Avoid technology-specific libraries and dependencies (e.g., AWS extensions from Vanilla Spark)

⁄ Using as many reusable components as possible, so only these components have to be adapted during a migration

⁄ Test automation to ensure an easy validation after a migration

23

Page 23: Boost your Cloud Data & Analytics

Case Study 1: Multi-Cloud D&A with Azure DevOpsAutomotive

Page 24: Boost your Cloud Data & Analytics

CAP Data pipelines

25

Infrastructure as a Code used for maintaining and provisioning all cloud infrastructure (Terraform)

S3 object storage leveraged for maintaining best possible value/price ratio

ETL processing based on SPARK ETLs (AWS Glue serverless service) implemented in Scala

Page 25: Boost your Cloud Data & Analytics

Multi-Cloud CI/CD pipeline design for CAP

26

Azure

AWS

Redshift

AWS

GitHubS3

Build

Validation

Release and deployment

Source Code

• Spark ETL unit tests• Test Terraform config

using in isolated sandbox• Terraform checkov test• DB change validation• Shared libraries validation• Automated code checks

risk analysis (security, license, Black Duck scan),

• Code quality checks SonarQube

Upload source code

and metadata

Apply Terraform

config

Apply DB changes

Build pipelines Release pipelines Secret

management Staging

environments

Azure

Databricks

Azure

Blob storage

Upload metadata

Apply Terraform

config

Deploy notebooks

⁄ CI/CD pipeline implemented in Azure Cloud DevOps solution

⁄ Integrated with AWS, Azure and GitHub

⁄ Building and validation of Scala based Glue ETL spark jobs

⁄ Automated unit testing – unit tests are executed in Azure DevOps agents

⁄ Pull request validation pipeline supports code reviews

⁄ Infrastructure as a code (Terraform) – all infrastructure changes are validated and released using CI/CD pipeline

⁄ Pipeline configured including staging environments dev, test and prod

⁄ Dedicated Database change management pipeline designed for validation and releases RedShift/PostgreSQL

AWS Glue

Page 26: Boost your Cloud Data & Analytics

Case Study 2: Portable D&A PlatformSpark & KubernetesAutomotive

Page 27: Boost your Cloud Data & Analytics

Case Study: Portable Data & Analytics Platform German Large Corporation

First Use Case

⁄ Creation of a platform for Analytics on car configuration data, driven by new regulatory requirements

⁄ The data must be processed locally in a unified way in the covered regions (EU, US, China)

⁄ The platform should accommodate further Big Data & Analytics programs in the future

⁄ The platform should be able to be deployed on-prem as well as in the Cloud

After assessing several possible solutions (incl. Cloud, Legacy DWH solutions) a Cloud-agnostic best-of-breed approach based on Kubernetes, Apache SPARK and Object Storage was selected.

28

Page 28: Boost your Cloud Data & Analytics

Portable D&A Platform with Spark & Kubernetes

⁄ Kubernetes: „Run K8s Anywhere - Kubernetes is open source giving you the freedom to take advantage of on-premises, hybrid, or public cloud infrastructure, letting you effortlessly move workloads to where it matters to you.“ (https://kubernetes.io/de/)

⁄ Apache SPARK: Already for the on-prem and now for the Cloud Data Lakes one of the most used platforms for parallel processing of Big Data

⁄ Since 2018, Kubernetes is available as an alternative Apache Spark Cluster Manager (next to Mesos & Yarn)

29

Page 29: Boost your Cloud Data & Analytics

Solution Overview

Presentation title30

Databases

Files

Web Interface

Metadata and Control Logging

Schema Store

Raw Layer Prepared Layer Presentation Layer

Data Sources

User Interface

Web Services

Data Scientists

Data Orchestration/ETL - Manage, Schedule & Monitor

Page 30: Boost your Cloud Data & Analytics

Agility with CI/CD Pipelines

31

1. Commit the code to SCM

2a. SCM Polling through Web-hook enabled for application code

4. Argo workflow to trigger the job on Kubernetes cluster

5. Pull the image and run the job

Application code repo

ETL JOB DAGS repo

yaml

k8s

Driver PodExecutor Pod

Executor Pod

a b c d

d

a) argowf submits the job to k8sb) k8s creates the driver podc) driver requests k8s to create executor

podsd) k8s creates the executors

Page 31: Boost your Cloud Data & Analytics

Portable from Day One

32

Object Data Storage on Prem

(Ceph)

VM on-prem

AWS S3 ObjectData Storage

AWS EC2

First Prototype developed in Adastra

Germany Data Engineering Lab (AWS)

AWS S3 ObjectData Storage

AWS EC2

Following Sprints in AWS-Account of the

Customer

Final Developmenton on-prem Infrastructure

of the customer

Page 32: Boost your Cloud Data & Analytics

One Framework, deployed everywhere

Presentation title33

Object Data Storage on Prem

(z.B. Ceph )

VM on-prem Prod

AWS S3 Object Data Storage

AWS EC2 Prod

Azure Data LakeStorage

Azure VM Prod

Google Cloud Storage

GCP Compute EngineProd

Telekom Object Storage Services

Telekom ECS Prod

AWS EC2 DEV AWS EC2 TEST

Page 33: Boost your Cloud Data & Analytics

Adastra Modular Cloud D&A Framework Overview

Page 34: Boost your Cloud Data & Analytics

Adastra Modular Cloud D&A Framework Overview

35

Templates Scripts Policies

Networking Apps Storage Security Cloud Infra

CI/CD Pipeline

Reusable Modules

Ingestion Historization and CDC

Control and metadata capture

Generic Data Transformation

Data Lake best practices

Coding Best practices

Infrastructure As a Code Design and Development

Page 35: Boost your Cloud Data & Analytics

Thank you!Adastra GmbH

Niedenau 36, 60325 Frankfurt am Main.+49 (0)69 719 779 790 / [email protected]

www.adastragrp.com