DevOps for Big Data - Data 360 2014 Conference
-
Upload
grid-dynamics -
Category
Data & Analytics
-
view
140 -
download
3
description
Transcript of DevOps for Big Data - Data 360 2014 Conference
DevOps for Big DataEnabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau
1
Max Martynov, VP of TechnologyGrid Dynamics
2
Introductions
• Grid Dynamics─ Solutions company, specializing in eCommerce
─ Experts in mission-critical applications (IMDGs, Big Data)
─ Implementing Continuous Integration and Continuous Delivery for 5+ years
• Qubell─ Enterprise DevOps platform
─ Focused on self-service environments, service orchestration, and continuous upgrades
─ Targets web-scale and big data applications
3
State of DevOps and Continuous Delivery
Continuous Delivery Value
• Agility
• Transparency
• Efficiency
• Consistency
• Quality
• Control
Findings from The 2014 State of DevOps Report
• Strong IT performance is a competitive advantage
• DevOps practices improve IT performance
• Organizational culture matters
• Job satisfaction is the No. 1 predictor of organizational performance
4
Continuous Delivery Infrastructure
• Environments ─ Reliable and repeatable deployment automation
─ Database schema management
─ Data management
─ Application properties management
─ Dynamic environments
• Quality─ Test automation
─ Test data management (again)
─ Code analysis and review
• Process─ Source code management, branching strategy
─ Agile requirements and project management
─ CICD pipeline
* Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments.
5
Implementing Continuous Delivery for Big Data:Initial State of the Project
• Medium size distributed development team
• Diverse technology stack – Hadoop + Vertica + Tableau
• Only one environment existed and it was production
• Delivery pipeline:
• Procurement of hardware for a new environment was taking months
Development Team
Production
6
Development in Production
It is fun until somebody misses the nail
7
Hadoop Analytical Application
Master
Database
Slaves 1 - N
Manager
10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware serversHow to quickly reproduce this environment for dev-test purposes?
8
1. Stop Gap Measure
• Same hardware, different logical “zones” implemented on the file system
• Automated build and deployment
• Delivery pipeline:
Development Team
Production cluster
/test1-N
/stage
/prod
Zones
9
1. Stop Gap Measure: Pros and Cons
Pros
• Better than before: code can be tested before it goes to production
• All logical environments has access to the same production data
• Zero additional environment costs
Cons
• Stability, security and compliance issues: dev, test and prod environments share same hardware
• Performance issues: tests affect production performance
• Impossible to run “destructive” tests that affect shared production data
• Impossible to test upgrades of middleware (new versions of H* components)
10
2. Hadoop Dynamic Environments
DataCustom
Application
Dev
Components
Services Environment Policies
QA
ProdStage
Dev/QA/Ops
Request Environment
Orchestrate environment provisioning and application
deployment
Environment
11
2. Hadoop Dynamic Environments (continued)
• Dev/QA/Ops teams got a self-service portal to ─ provision environments
─ deploy applications
• A new environment can be created from scratch in 2-3 hours─ singe-node dev sandbox
─ multi-node QA
─ big clusters for scalability and performance
• An application can be deployed to an environment within 10 minutes
12
3. Vertica and Tableau Dynamic Environments
Data UDF
Dev
Components
ServicesEnvironment
Policies
QA
ProdStage
Dev/QA/Ops
Request Environment
Orchestrate environment provisioning and application
deployment
Environment
VSQL Config
Shared service
13
Unit Tests
Component Tests
Integration Tests(integration with data)
4. Tests & Test Data
• Dev and QA teams implemented automated tests
• Two options to handle data on dev-test environments:
1. Tests generate data for themselves
2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)
Exploratory Tests
Java code, auto-generated data;build-time validation
Auto tests on “API” level, testing job output;test-generated data
Auto tests on “API” level, validating job output;snapshot of production data
Manual tests;snapshot of production data
14
5. CICD pipeline
With all components ready, implementing CICD pipeline is easy:
Development Team
Dev Sandbox QA Environment
Github Flow2. Commit
1. Develop & Experiment
3. Build & unit test
4. Deploy 5. Test
6. Release
15
6. Release Button
Release Candidate
Release
ProductionOps/RE
16
Assembly Line
17
Results
• Reduced risk and higher quality─ No more development in production
─ Developers have sandboxes, tests are run on separate environments
─ Feature are deployed to production only after validation
• Increased efficiency─ A new environment can be provisioned within 2 hours
─ Developers can freely experiment with new changes
─ No resource contention
• Reduced costs─ No need to procure in-house hardware and manage in-house datacenter
─ Dynamic environments save money by using them on only when they are needed
18
Enabling Technologies
Agile Software FactorySoftware Engineering Assembly Line
griddynamics.com
QubellEnterprise DevOps Platform
qubell.com
A P R I L 8 , 2 0 2 3
Thank You
19
Max Martynov, VP of Technology, Grid [email protected]
Victoria Livschitz, CEO and Founder, [email protected]