Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

25
Big Data Anti-Patterns: Lessons from the Front Lines Douglas Moore Principal Data Architect Think Big, a Teradata Company

description

Big Data Anti-Patterns: Lessons from the Front Lines Drawn from over 50 client engagements, big data anti-patterns are common practices that make for bad solutions.

Transcript of Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Page 1: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1

Big Data Anti-Patterns: Lessons from the Front Lines

Douglas Moore

Principal Data Architect

Think Big, a Teradata Company

Page 2: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 22

Think Big – 3 Years

- Roadmaps- Delivery

• BDW, Search, Streaming - Tech Assessments

About Douglas Moore

Before Big Data

- Data Warehousing- OLTP- Systems Architecture- Electricity- High End Graphics - Supercomputers- Numerical Analysis

@douglas_maContact me at:

Page 3: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 33

Think Big

4yr Old “Big Data” Professional Services Firm

- Roadmaps- Engineering- Data Science- Hands on Training

Recently acquired by Teradata• Maintaining Independence

Page 4: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 44

Content Drawn From Vast Amounts of Experience

50+ Clients

Leading security software vendor

Leading Discount Retailer

Page 5: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 55

I started out with just 3 topics…

Then while on the road to Strata,

I met 7 big data architects

- Who had 7 clients

• Who had 7 projects

• That demonstrated 7 Anti-Patterns

Introduction

Big Data Anti-pattern: “Commonly applied but bad solution”

I95 Wikipedia

Page 6: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 6

• Hardware and Infrastructure

• Tooling

• Big Data Warehousing

Three Focus Areas

6

Page 7: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 77

Reference Architecture Driven

- 90’s & 00’s data center patterns- Servers MUST NOT FAIL- Standard Server Config

• $35,000/node• Dual Power supply• RAID• SAS 15K RPM• SAN• VMs for Production• Flat Network

Hardware & Infrastructure

[Image source: HP: The transformation to HP Converged Infrastructure]

Automated provisioning is a good thing!

Page 8: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 88

Locality Locality Locality

- Bring Computation to Data

#1 Locality

Co-locate data and compute

Locally Attached Storage

Localize & isolate network traffic

Rack AwarenessVM Cluster Hadoop Cluster

Page 9: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 99

Sequential IO >> Random Access

#2 Sequential IO

http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Image credit: Wikipedia.org

Large block IO

Append only writes

JBOD

Page 10: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1010

Increase # parallel components

- Reduce component cost

Data block replication

- Performance- Availability

Commodity++ (2014)

- High density data nodes- $8-12,000- ~12 drives- ~12-16 cores- Buy more servers for the cost of

one• 4-5x spindles • 4-5x cores

#3 Increase Parallelism

Hadoop Cluster

Page 11: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1111

Hadoop Cluster

Expect Failure1,2 Rack Awareness

Data Block Replication

Task Retry

Node Black Listing

Monitor Everything

Name Node HA

#4 Failure

Page 12: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1212

Hadoop Ecosystem Tools

Tooling

http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG

Page 13: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1313

“If it came in the box then I should use it”

Example

- Oozie for scheduling

Tooling: Just looking inside the box

Best Practice: • Use your current enterprise scheduler

Page 14: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1414

Tooling: NoSQL

• “Now I have all of my log data in NoSQL, let’s do analytics over it”

Example

- Streaming data into Mongo DB• Running aggregates• Running MR jobs

Page 15: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1515

Best Practice

Best Practice: • Split the stream

• Real-time access in NoSQL• Batch analytics in Hadoop

Page 16: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1818

Hadoop Streaming- Integrate legacy code- Integrate analytic tools

• Data science libs

Hadoop integrates any type of application tooling- Java- Python- R- C, C++- Fortran- Cobol- Ruby

Right Framework, Right Need…

Page 17: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 1919

Got to love Ruby

- Very Cool (or it was)- Dynamic Language- Expressive- Compact- Fast Iteration

Got to Hate Ruby

- Slow- Hard to follow & debug- Does not play well with

threading

Right Use Case – ETL, Wrong Framework

“It’s much faster to develop in, developer time is valuable, just throw a couple more boxes at it”

Bench tested Ruby ETL framework at 5,000 records / second

Page 18: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 2020

Right Use Case – ETL, Wrong Framework…

Best Practice:• Write new code in fastest execution framework• High value legacy code, analytic tools use Hadoop Streaming• Innovation is Important: Test and Learn

DO THE MATH:

Storm Java: ~ 1MM+ events / second / server

Storm Ruby: 5000 * 12 cores = 60,000 events / second / server= 16.67 times more servers

bit.ly/1t0HXJH

Page 19: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 2121

Hadoop Use Cases

1. ETL Offload

2. Data Warehousing

Big Data Warehousing

Hadoop Data Types

1. Structured

2. Semi-structured

3. Multi or Unstructured

Page 20: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 2222

Don’t over curate:

“We are going to

- Define and parse 1,000 attributes from the machine log files on ETL servers,

- load just what we need to,- this will take 6 months”

HCatalog

Navigator, Loom,…

UDFs, UDTFs

- JSON, Regex built in- Custom Java- Hadoop Streaming (e.g. use

Python, Perl) Hive Partitions

Recursive directory reads

Bucket Joins

Columnar formats

- ORC- Parquet

First Principles: #5 Schema on Read

Best Practices:• Define what you need to• Parse on Demand• Structure to optimize• Beware the data palace

fountain & data swamp

Page 21: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 2323

Right Schema

Data Warehouse

OLTP

Hadoop

Page 22: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 24

Workload Hadoop NoSQL MPP, Reporting DBs, Mainframe

ETL

Business Intelligence

Cross business reporting

Sub-set analytics

Full scan analytics

Decision Support TBs-PBs GB-TBs

Operational Reports

Complex security requirements

Search

Fast Lookup

Right Workload, Right Tool

Page 23: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 2525

Understand strengths & weaknesses of each choice

- Get help as needed to make your first effort successful Deploy the right tool for the right workload

Test and Learn

Summary

http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-climb-on-94/

Page 24: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

| 2626

Thank You

Work with the best on a wide range of cool projects:[email protected]

@douglas_ma

Douglas Moore

Page 25: Teradata Partners Conference Oct 2014   Big Data Anti-Patterns

DATA SCIENTISTS

DATA ARCHITECTS

DATA SOLUTIONS

Think Big Start Smart Scale Fast

Work with theLeading Innovator in Big Data

27