DevOps for Big Data - Data 360 2014 Conference

DevOps for Big DataEnabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau

Max Martynov, VP of TechnologyGrid Dynamics

Introductions

• Grid Dynamics─ Solutions company, specializing in eCommerce

─ Experts in mission-critical applications (IMDGs, Big Data)

─ Implementing Continuous Integration and Continuous Delivery for 5+ years

• Qubell─ Enterprise DevOps platform

─ Focused on self-service environments, service orchestration, and continuous upgrades

─ Targets web-scale and big data applications

State of DevOps and Continuous Delivery

Continuous Delivery Value

• Agility

• Transparency

• Efficiency

• Consistency

• Quality

• Control

Findings from The 2014 State of DevOps Report

• Strong IT performance is a competitive advantage

• DevOps practices improve IT performance

• Organizational culture matters

• Job satisfaction is the No. 1 predictor of organizational performance

Continuous Delivery Infrastructure

• Environments ─ Reliable and repeatable deployment automation

─ Database schema management

─ Data management

─ Application properties management

─ Dynamic environments

• Quality─ Test automation

─ Test data management (again)

─ Code analysis and review

• Process─ Source code management, branching strategy

─ Agile requirements and project management

─ CICD pipeline

* Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments.

Implementing Continuous Delivery for Big Data:Initial State of the Project

• Medium size distributed development team

• Diverse technology stack – Hadoop + Vertica + Tableau

• Only one environment existed and it was production

• Delivery pipeline:

• Procurement of hardware for a new environment was taking months

Development Team

Production

Development in Production

It is fun until somebody misses the nail

Hadoop Analytical Application

Master

Database

Slaves 1 - N

Manager

10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware serversHow to quickly reproduce this environment for dev-test purposes?

1. Stop Gap Measure

• Same hardware, different logical “zones” implemented on the file system

• Automated build and deployment

• Delivery pipeline:

Development Team

Production cluster

/test1-N

/stage

1. Stop Gap Measure: Pros and Cons

• Better than before: code can be tested before it goes to production

• All logical environments has access to the same production data

• Zero additional environment costs

• Stability, security and compliance issues: dev, test and prod environments share same hardware

• Performance issues: tests affect production performance

• Impossible to run “destructive” tests that affect shared production data

• Impossible to test upgrades of middleware (new versions of H* components)

2. Hadoop Dynamic Environments

DataCustom

Application

Components

Services Environment Policies

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

2. Hadoop Dynamic Environments (continued)

• Dev/QA/Ops teams got a self-service portal to ─ provision environments

─ deploy applications

• A new environment can be created from scratch in 2-3 hours─ singe-node dev sandbox

─ multi-node QA

─ big clusters for scalability and performance

• An application can be deployed to an environment within 10 minutes

3. Vertica and Tableau Dynamic Environments

Data UDF

Components

ServicesEnvironment

Policies

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

VSQL Config

Shared service

Unit Tests

Component Tests

Integration Tests(integration with data)

4. Tests & Test Data

• Dev and QA teams implemented automated tests

• Two options to handle data on dev-test environments:

1. Tests generate data for themselves

2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)

Exploratory Tests

Java code, auto-generated data;build-time validation

Auto tests on “API” level, testing job output;test-generated data

Auto tests on “API” level, validating job output;snapshot of production data

Manual tests;snapshot of production data

5. CICD pipeline

With all components ready, implementing CICD pipeline is easy:

Development Team

Dev Sandbox QA Environment

Github Flow2. Commit

1. Develop & Experiment

3. Build & unit test

4. Deploy 5. Test

6. Release

6. Release Button

Release Candidate

Release

ProductionOps/RE

Assembly Line

Results

• Reduced risk and higher quality─ No more development in production

─ Developers have sandboxes, tests are run on separate environments

─ Feature are deployed to production only after validation

• Increased efficiency─ A new environment can be provisioned within 2 hours

─ Developers can freely experiment with new changes

─ No resource contention

• Reduced costs─ No need to procure in-house hardware and manage in-house datacenter

─ Dynamic environments save money by using them on only when they are needed

Enabling Technologies

Agile Software FactorySoftware Engineering Assembly Line

griddynamics.com

QubellEnterprise DevOps Platform

qubell.com

A P R I L 8 , 2 0 2 3

Thank You

Max Martynov, VP of Technology, Grid Dynamicsmmartynov@griddynamics.com

Victoria Livschitz, CEO and Founder, Qubellvlivschitz@qubell.com

DevOps for Big Data - Data 360 2014 Conference

Data & Analytics

Transcript of DevOps for Big Data - Data 360 2014 Conference

DevOps Perspectives 6€¦ · DevOps, we look at how other technology areas are touching on the subject—for example, how to bring DevOps into the realms of big data, data science

Data Science, Devops, Python, Block Chain Training In ...

Data-driven DevOps with IBM Virtual Data Pipeline and IBM ...

APM DATA 360 Brochure v6 - symphonyazimaai.com · Introducing APM Data 360 APM Data 360™ by Symphony AzimaAI is an asset benchmarking data lake for rotating machines. Based on 25+

Google Data Studio 360 Tutorial

Azure DevOps 핸즈온랩 : Boards에서부터 Pipelinesmeetup.devopskorea.com/201906/data/Azure DevOps HOL... · 2019-08-20 · Azure DevOps: Choose the tools and clouds you love

DevOps - Introduction to data science

DevOps+Data: Working with Source Control

cloud native org - pic.huodongjia.com€¦ · DevOps UI Data Sources AWS VP Discovery Engineering Directors Development Developers + DevOps Discovery Data Sources AWS VP Platform

Devops Devops Devops

Exploring DevOps for Data Analytical System with Essential ...ksiresearchorg.ipage.com/seke/seke16paper/seke16paper_220.pdfexploring DevOps for data analytical system is valuable.

Target DevOps Bottlenecks with Connected Lifecycle Data

DICE: a Model-Driven DevOps Framework for Big Datawp.doc.ic.ac.uk/dice-h2020/wp-content/uploads/sites/75/...• DevOps for Big data • DataOps: DevOps & data engineering • DevOps

Accelerating Devops via Data Virtualization | Delphix

Extending DevOps to Big Data Applications with Kubernetes

BMD-360, data sheet

Embracing Cloud Deployment for Big Data and DevOps

DevOps, CD and [Data] Microservices

NoSQL and Big Data for DevOps

DEVOPS ADOPTION - Accenture · DevOps analytics turns data from DevOps tools into insights that aid in decision-making. It also gives stakeholders visibility into various DevOps practices,