Elephant grooming: quality with Hadoop

45
Roman Nikitchenko, 14.02.2015 WITH QUALITY SUBJECTIVE ELEPHANT GROOMING

Transcript of Elephant grooming: quality with Hadoop

Roman Nikitchenko, 14.02.2015

WITHQUALITY

SUBJECTIVE

ELEPHANTGROOMING

2

First ever worldDATA OS

10.000 nodes computer... Recent technology changes are focused on higher scale, better resource control, lower latency, higher security and fault tolerance.

3

x MAX+

=

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

Hadoop in one picture

4

OPEN SOURCE framework for big data. Both distributed storage and processing.

Provides RELIABILITY and fault tolerance BECAUSE OF SOFTWARE design. Example — File system as replication factor 3 as default one.

Unique horisontal scalability from single computer up to thousands of nodes.

5

HADOOPThis is what everybody told you. Starting from your Hadoop distribution vendor.

6

Small, dirty, clumsy, always hungry … good if alive at all

HADOOP What you really get is ...

7

YARNOur HADOOP is healthy and is growing.

8

TODAY WE FOCUS ON QUALITY

Billions of medical records processed

every day

10s of millions medical histories

9

QUALITYFears and reality

10

Feeling of something wrong

FEAR OF

WHAT IS THE ROOT CAUSE?

UN

KN

OW

N

LOOKS TOO EASY

11

How everyone (who usually sells something) depicts

Hadoop complexity

GREAT BIG INFRASTRUCTURE AROUND

SMALL CUTE CORE

YOUR APPLICATION

SAFE and FRIENDLY

12

How it looks from the real user point of view

Feeling of something wrong

CORE HADOOPC

OM

PLETELY

UN

KN

OW

N

INFR

AS

TR

UC

TU

RE

SO

METH

ING

YO

U

UN

DER

STA

ND

YOUR APPLICATION

FEAR OF

13

NO BACKUPS

REALITY

14

Most of failures in Hadoop are not about your functionality but about infrastructure. Any testing strategies are to account it.

REALITY

15

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

BIG DATA

Failure is normal case

Failures are normal in Hadoop and happen EVERY DAY. This is MAJOR difference for software. Do your test cover severe performance degradation or disaster recovery procedure?

REALITY

16

No more isolated testing. You are geooming not only elephant

REALITY OF

Infrastructure is really complex. Just here: Hive, Hadoop, Giraf, Tez, Pig, Tomcat, Wildfly...REALITY

17

BIG DATA is not about the

data. It is about OUR ABILITY TO HANDLE THEM.

REALITY

18

VERIFICATION INFRASTRUCTURE

BIGGEST BIG DATA failure IS ...

NO DATAThe same about testing infrastructure

19

STAGINGCOMMON GIRL ISSUES

● All people need staging environment but who should perform maintenance?

● Can you drop all data on staging cluster if other team need it?

● This is exactly what usually happens in production.

20

● Grants limited isolation from other teams work.

● Don't addict to this drug! You can miss serious integration issues in production.

● Your environment definitely gets underused this way.

STAGINGTIME DIVISION MULTIPLEXING

21

STAGING

Try to find something

here

MULTINODE CLUSTERS

Multinode — hard to investigate issues. Single node — does not cover scalability cases.

22

STAGINGPHYSICAL CLUSTERS

● Good start: USED single chassis server: 4 nodes (1 master + 3 workers), each 2x6 cores, 64G RAM. HDD is up to you. About $5K (2014).

● You can save on SSD and siphisticated I/O. Do not save on CPU. Have memory upgrade plan.

23

SUBJECTIVE

VIRTUAL STAGING

CLUSTERSNOT SO REAL ELEPHANT

● If you production cluster is virtual — here you go!

● Public clouds — unclear budget and resources. Great fast start.

● Single node virtual machine for developers — hard to support.

24

OUR ENVIRONMENT FOR AUTOMATED TESTING

SUREFIRE

Integration testing

Unit testing

TESTING UTILITY

MINICLUSTER

ARTIFACTS!

LOCAL WORKERS

TESTING SEQUENCE

org.apache.hbase hbase-testing-util

org.apache.hadoop hadoop-mnicluster

● Everything is in SINGLE JVM scope including code under test so you can attach debugger, profiler or measure test coverage.

● Environment starts in about 10 seconds. Everything is just test dependency in maven. Actually could be used even in unit testing sequence.

● All services are dynamic so more than one developer can run tests on single host. Configuration is dynamic.

OUR OWN WRAPPER

25

STEP FORWARDDEVELOPMENT ENVIRONMENT

LOCAL WORKERS

● Now mini-cluster is service. Starts in about 30 seconds. Static service ports. YARN and logging like in real cluster. But mostly we reuse auto-testing cluster components.

● Developer can use local workers for MapReduce and Spark (single JVM with its code) or can use cluster services close to REAL cluster.

● Hbase Lily indexer, SOLR and Hive are started as separate JVM. Everything is taken through Maven dependency.

MINICLUSTER CORE PREVIOUS SLIDE

SINGLE JVM SEPARATE JVMS

26

Hadoop: don't do it yourself

STAGING● Yet we build real hardware staring cluster.

● We use Cloudera solutions, ready Maven artifacts, exchange experience on conferences and much more.

27

Unit testing

Integration testing

● Everything outside Big Data is to be checked BEFORE Big Data adds complexity.

● No elephants before integratino test phase. Packages are to be ready. Only simple things like general logic checks can be done before.

● Any test environment is to be created from scratch or at least checked for consistency.

TEST STRATEGY

28

BETTER TEST YOUR ARCHITECTURE, ONLY THEN IMPLEMENTATION

29

TESTING IN PRODUCTION

● You cannot avoid testing in production if you do Hadoop.

● It is bright if your application can work into the same cluster but with different data.

● Bring security and resource control so your test runs cannot harm production jobs.

● If your solution is non-realtime and have unused time slots, use them.

30

APPROACH TO PACKAGE AND DEPENDENCY MANAGEMENT

31

MAKING IT EASIER

Lowering verification efforts

32

WHY HADOOP TESTING IS SO HARD?

Source of complexity What to do

Extra work because of unfamiliar environment

Assure verification engineers have adequate Linux knowledge!

Inadequate environment. Having memory overcommitment before test you get everything wrong

Provide adequate hardware resources which can reproduce production issues including scalability ones

Issues come from outside of your tests

Check test pre-conditions!

33

● QA are to monitor code quality metrics. Not only developers.

● Project size metrics mater. Track correlation between lines of code number of comment lines and not covered lines.

● Branches coverage matters extremely in scalable solutions. Think about statement coverage if you use Scala.

We do it with

MEASURE AND PUSH YOUR CODE QUALITY TO GET SOLUTION QUALITY

34

● Force developers to reuse approaches, solutions, infrastructure, components. Probability to find something unexpected in already tested reused approach is much lower.

● Force QA to automate test processes, environment setup, release engineering. Consider automatice code validation before integration.

PUT EFFORTS TO LOWER EFFORTS

35

AUTO-GROOMING… at least some steps

Unit testing

Integration testing

MASTER gets built with all checks.

Metrics go to SonarQube, results are published

MASTER build

After review change is pushed into remote integration branch

Integration branches are monitored, locally merged to master and

built

Developer works in private branch.

Builds are local with all possible checks.

Integration

branch push

Continuous integration environment

Development in private branchR

eview

inte

grat

ion

build

Further release engineering

On integration passed merged branch is pushed to master

36

NEVER!

WHEN TO STOP VERIFICATION?

No surprise.

37

SCALING QUALITY

Grooming growing elephant.

38

UNIFIED HADOOP DOES NOT EXISTS

● Started with 4 virtual machines inside AMD 4 cores / 16G desktop to try.

● Then you start to buy i5 4 cores / 32G desktops to build something working.

● Then you start adding E5 12 cores / 48G servers to get results.

● As you grow you 100% go heterogeneous with better hardware. 'Partial' failure on unified cluster puts you in this state.

● So think about heterogeneous RIGHT FROM THE START. Both from design and testing point of view.

39

Every cluster is different. Add scale configuration to your tests and client configuration

1x node development mini-cluster

Large producion cluster

4x nodes staging cluster

CLUSTERS DIFFER

40

PERFORMANCEDo you know your enemy?

● Majors: memory, CPU, I/O, network. QA MUST detect resource usage skews.

● CPU is most easy to see. Always balance between optimization and new hardware. Most important for QA is to understand how it is used.

● Memory usage can be hard to understand. Avoid swapping almost at all costs. Global trend this resource is vital.

● HDD are cheap. Just buy more space if you need.

● Network comes last but is hard to tame. If your bottleneck is switch, replace it but it's hard to upgrade channel for every node so here QA should track architecture scalability.

41

Give your verifiaction team access to production! They must detect issues before they are reported by support team.

Establish and constantly monitor data quality metrics and resource usage on production.

Be proactive. Explain unknown before it puts you into troubles.

TESTING FOREVER

42

I WANT BETTER ELEPHANT

GAPS

43

CURRENT GAP● Resource bottlenecks resolution is

manual process. No silver bullet.

● Much more easy if you can reproduce it on single node or in single VM scope.

44

CURRENT GAP

● Operational mistakes happen and it is really hard to handle them by verification.

● Usually it can be handled by better design. So prefer to test your approaches and architecture, not just implementation.

45

Questions and discussion