Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

| 1

Big Data Anti-Patterns: Lessons from the Front Lines

Douglas Moore

Principal Data Architect

Think Big, a Teradata Company

| 22

Think Big – 3 Years

- Roadmaps- Delivery

• BDW, Search, Streaming - Tech Assessments

About Douglas Moore

Before Big Data

- Data Warehousing- OLTP- Systems Architecture- Electricity- High End Graphics - Supercomputers- Numerical Analysis

@douglas_maContact me at:

| 33

Think Big

4yr Old “Big Data” Professional Services Firm

- Roadmaps- Engineering- Data Science- Hands on Training

Recently acquired by Teradata• Maintaining Independence

| 44

Content Drawn From Vast Amounts of Experience

…

50+ Clients

Leading security software vendor

Leading Discount Retailer

| 55

I started out with just 3 topics…

Then while on the road to Strata,

I met 7 big data architects

- Who had 7 clients

• Who had 7 projects

• That demonstrated 7 Anti-Patterns

Introduction

Big Data Anti-pattern: “Commonly applied but bad solution”

I95 Wikipedia

| 6

• Hardware and Infrastructure

• Tooling

• Big Data Warehousing

Three Focus Areas

6

| 77

Reference Architecture Driven

- 90’s & 00’s data center patterns- Servers MUST NOT FAIL- Standard Server Config

• $35,000/node• Dual Power supply• RAID• SAS 15K RPM• SAN• VMs for Production• Flat Network

Hardware & Infrastructure

[Image source: HP: The transformation to HP Converged Infrastructure]

Automated provisioning is a good thing!

| 88

Locality Locality Locality

- Bring Computation to Data

#1 Locality

Co-locate data and compute

Locally Attached Storage

Localize & isolate network traffic

Rack AwarenessVM Cluster Hadoop Cluster

| 99

Sequential IO >> Random Access

#2 Sequential IO

http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Image credit: Wikipedia.org

Large block IO

Append only writes

JBOD

| 1010

Increase # parallel components

- Reduce component cost

Data block replication

- Performance- Availability

Commodity++ (2014)

- High density data nodes- $8-12,000- ~12 drives- ~12-16 cores- Buy more servers for the cost of

one• 4-5x spindles • 4-5x cores

#3 Increase Parallelism

Hadoop Cluster

| 1111

Hadoop Cluster

Expect Failure1,2 Rack Awareness

Data Block Replication

Task Retry

Node Black Listing

Monitor Everything

Name Node HA

#4 Failure

| 1212

Hadoop Ecosystem Tools

Tooling

http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG

| 1313

“If it came in the box then I should use it”

Example

- Oozie for scheduling

Tooling: Just looking inside the box

Best Practice: • Use your current enterprise scheduler

| 1414

Tooling: NoSQL

• “Now I have all of my log data in NoSQL, let’s do analytics over it”

Example

- Streaming data into Mongo DB• Running aggregates• Running MR jobs

| 1515

Best Practice

Best Practice: • Split the stream

• Real-time access in NoSQL• Batch analytics in Hadoop

| 1818

Hadoop Streaming- Integrate legacy code- Integrate analytic tools

• Data science libs

Hadoop integrates any type of application tooling- Java- Python- R- C, C++- Fortran- Cobol- Ruby

Right Framework, Right Need…

| 1919

Got to love Ruby

- Very Cool (or it was)- Dynamic Language- Expressive- Compact- Fast Iteration

Got to Hate Ruby

- Slow- Hard to follow & debug- Does not play well with

threading

Right Use Case – ETL, Wrong Framework

“It’s much faster to develop in, developer time is valuable, just throw a couple more boxes at it”

Bench tested Ruby ETL framework at 5,000 records / second

| 2020

Right Use Case – ETL, Wrong Framework…

Best Practice:• Write new code in fastest execution framework• High value legacy code, analytic tools use Hadoop Streaming• Innovation is Important: Test and Learn

DO THE MATH:

Storm Java: ~ 1MM+ events / second / server

Storm Ruby: 5000 * 12 cores = 60,000 events / second / server= 16.67 times more servers

bit.ly/1t0HXJH

| 2121

Hadoop Use Cases

1. ETL Offload

2. Data Warehousing

Big Data Warehousing

Hadoop Data Types

1. Structured

2. Semi-structured

3. Multi or Unstructured

| 2222

Don’t over curate:

“We are going to

- Define and parse 1,000 attributes from the machine log files on ETL servers,

- load just what we need to,- this will take 6 months”

HCatalog

Navigator, Loom,…

UDFs, UDTFs

- JSON, Regex built in- Custom Java- Hadoop Streaming (e.g. use

Python, Perl) Hive Partitions

Recursive directory reads

Bucket Joins

Columnar formats

- ORC- Parquet

First Principles: #5 Schema on Read

Best Practices:• Define what you need to• Parse on Demand• Structure to optimize• Beware the data palace

fountain & data swamp

| 2323

Right Schema

Data Warehouse

OLTP

Hadoop

| 24

Workload Hadoop NoSQL MPP, Reporting DBs, Mainframe

ETL

Business Intelligence

Cross business reporting

Sub-set analytics

Full scan analytics

Decision Support TBs-PBs GB-TBs

Operational Reports

Complex security requirements

Search

Fast Lookup

Right Workload, Right Tool

| 2525

Understand strengths & weaknesses of each choice

- Get help as needed to make your first effort successful Deploy the right tool for the right workload

Test and Learn

Summary

http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-climb-on-94/

| 2626

Thank You

Work with the best on a wide range of cool projects:[email protected]

@douglas_ma

Douglas Moore

DATA SCIENTISTS

DATA ARCHITECTS

DATA SOLUTIONS

Think Big Start Smart Scale Fast

Work with theLeading Innovator in Big Data

27

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Technology

Transcript of Teradata Partners Conference Oct 2014 Big Data Anti-Patterns